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This  letter  is  the  Annual  Progress  Report  for  our  research  program  supported  under 
DARPA-ONR  Contract  N000i4.82-K.0727.  ^ 

During  the  period  of  July  1,  1984  to  June  30,  1985,  we  have^continued  to  make 
progress  on  the  acquisition  of.  acoustic-phonetic  and  lexical  knowledge: ^Specifically: 

*  We  lMV^fcompleted<thf  development  of  a  continuous  digit  recognition  system. 
The  system  was  constructed  to  investigate  the^ranrertion  of  acoustic-phonetic 
knowledge  in  a  speech  recognition  system.  The  significant  achievements  of 
this  study  include  the  development  of  a  *soft-failure*  procedure  for  lexical 
access  and  the  discovery  of  a  set  of  acoustic-phonetic  features  for  verification. 

i2  We  fiave^completed  a  study  of  the  constraints  that  lexical  stress  imposes  on 
word  recognition.  We  found  that  lexical  stress  information  alone  can,  on  the 
average,  reduce  the  number  of  word  candidates  from  a  large  dictionary  by  more 
than  80  percent.  In  conjunction  with  this  study,  we  successfully  developed  a 
system  that  automatically  determines  the  stress  pattern  of  a  word  from  the 
acoustic  signal. 

(  fi •  We  frttve^ performed  an  acoustic  study  on  the  characteristics  of  qasal  conso¬ 

nants  and  nasalized  vowels.  We  have  also  developed  recognition  algorithms 
for  nasal  murmurs  and  nasalized  vowels  in  continuous  speech. 


(*; 


We^avelinished  the  preliminary  development  of  a  system  tnat  aligns  a  speech 
waveform  with  the  corresponding  phonetic  transcription.  ^ 


We  are  including  with  this  report  copies  of  all  publications,  in  the  form  of  theses 
and  papers  presented  at  various  conferences,  written  during  this  contracting  period. 


Enc. 

VWZ/kk 


Victor  W.  Zue 
Principal  Investigator 
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Chapter  Four:  Performance  Evaluation 
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probability  of  cnoo*lo<  a 
word  from  class  1 

Dunbar  of  classas 


There  is  strong  evidence  that  prosodic  information  serves  two  major  purposes  fo> 

Ihe  represenlulion  nicy  be  extendible  to  continuous-speech,  speaker- 

independent  speech  recognition.  Such  methods  have  motivated  the  recognition.  Fust.  lexical  stress  provides  a  sign.ficant  amount  of  lexical 

implementation  of  a  similar  representation  in  a  speaker-independent,  conslr-int  that  reduces  the  search  space  for  large  vocabulary  lexical  access.  Second, 

connected-digit  recognizer  by  Chen  (2|. 
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perceived  stress  lo  die  parameters  of  fundamental  frequency,  peak  envelope 
amplitude,  duration,  and  the  integral  of  the  amplitude  were  investigated.  Ilte 
stressed  syllable  had  a  higher  maximum  fundamental  frequency  than  the  unstressed 
of  the  same  token  in  90%  of  the  cases,  a  higher  peak  amplitude  in  87%.  a  longer 
duration  in  66%,  and  a  higher  integral  of  amplitude  in  92%.  fhe  stressed  syllable 
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fundamental  frequency  on  ihc  final  syllable  which  could  influence  ihc  decision  of  acoustic  correlates  likewise  increases.  Even  though  Klatl  (II)  reports  only  a  5% 

an  automated  scheme.  Woids  embedded  in  a  carrier  phrase  would  more  than  likely  duration  difference  between  primary  and  secondary  stressed  vowels,  ihc  difference 

show  a  more  natural,  gradual  decrease  or  declination  in  fundamental  frequency. 
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approximately  350  polysyllabic  words  spoken  by  three  female  and  four  male 
speakers.  The  filly  word  corpus  is  listed  in  Appendix  A.  The  majority  of  words  are 
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listen  to  a  recording  of  U.e  words  where  each  word  was  repeated  three  times.  Il.ey  surrounding  sonoranls.  This  study,  then,  also  investigates  the  influence  of  stress  on 

were  asked  to  .nark  which  syllable  received  the  greatest  stress  Tor  each  word,  or  to  .he  sono.an,  regions  of  the  syllables  in  comparison  to  the  vowel  regions, 

mark  the  word  as  ambiguous  if  no  clear  emphasis  could  be  perceived.  From  the 

Each  word  in  the  da  la  base  is  associated  with  its  time-aligned  phonetic 
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_  I  In  summary,  duration  seems  to  be  a  valuable  feature  to  incorporate  into  a  stress 

l  c.iu.rci  m  this  thesis  arc  defined  u.  he  acoustic  .ar.buies  as  opposed ...  dtstiocit.c  features  |JI  *  recognitiofi  system  as  it  is  relatively  easy  to  meastire  and  is  an  acoustic  cot  relaie  of 


2.1.3  Fundamental  Frequency 


covering  lilt  frequency  range  from  200  lo  2500  Hz.  Willi  this  method,  developed  by  words.  In  his  system,  ihe  speech  signal  is  crudely  segmented  into  broad  classes. 

SenelT |22|.  die  waveform  displays  strong  periodicity  at  the  fundamental  allowing  an  Word  candidates  are  proposed  by  matching  the  derived  phonetic  classification  of 

Average  Magnitude  Difference  Function  of  the  waveform  and  a  voicing  decision  to  lf'e  signal  against  those  stored  in  the  20000  v.ord  lexicon.  As  discussed  in  Chapter  I. 

produce  an  accurate  F0  contour  Similar  to  the  uiergy  measurements,  the  average  t  prosodic  information  in  addition  to  segmental  provides  additional  lexical  constraint. 

F0  is  computed  in  each  vowel  aad  sonoranl  region  of  the  word.  'Flic  maximum  ,n  particular,  the  segmental  information  around  stressed  syllables  is  more 
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would  Irene  lit  fiom  prosodic  information.  Such  a  system  proposed  by  I  lutlenlocher  _ 

|I0|  IS  a  speaker  independent,  isolated  word  system  with  a  vocabulary  of  20000  2W,wJs  f„m  Ok-  Ic.k.h.  Ih.*  d.,  .,ppc.,r  ,hc  It, own  Curpus  arc  a.biuur.l,  E„cn  a  frequency 
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guidelines  are  used.  The  number  of  syllables  equals  the  number  of  vowels  and  implementation.  therefore,  further  experiments  are  performed  where  less 

syllabic  consonants  in  the  bjseform.  Piiinary  stress,  as  marked  in  the  baseforms,  jj  knowledge  is  assumed.  Ihe  original  experiment  forms  a  basis  against  which  to 

establishes  die  stressed  syllable.  Schwas  in  the  pronunciation  are  consideied  as  I  compare  Ihe  constraining  power  of  other  classification  criteria. 


large  lexicon.  These  results  suggest  that  the  location  of  primary  stress  is  a  practical 
lexicon  Siza  tscoo  2000  tioo  goal  that  may  provide  adequate  lexical  constraint. 
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speech  signal  as  sonoranl  is  much  greater  than  individually  labcli.ig  vowels,  liquids, 
glides,  and  nasuls.  Once  the  initial  segmentation  labels  regions  of  the  speech  signal 
that  are  sonoranl.  it  may  be  necessary  to  examine  these  areas  in  greater  detail  for 


Figure  3-1:  Stress  Recognition  System  Outline 
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between  syllables,  such  as  Massachusetts.  The  words  require  no  further 
segmentation  after  die  initial  broad  segmentation  as  each  sonorant  region 
corresponds  to  a  syllable.  Hie  next  less  difficult  cases  are  intervocalic  nasals. 


An  alternative  to  the  above  center  of  gravity  measure  is  to  choose  a  weight  that  is  The  sum  Qf  the  weighted  spectral  components  at  each  point  ;n  ume  yields  a  time 

sensitive  at  crucial  points  and.  more  importantly,  incorporates  better  speech  specific  domain  waveform  that  can  be  readily  characterized  by  its  transitions  from  positive 

knowledge  to  weight  the  spectrum.  For  example,  in  order  to  make  a  front-back  io  negative  or  vice  versa.  In  the  more  detailed  segmentation,  this  is  the  means  of 
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locale  several  classes  of  sounds,  usually  in  an  intervocalic  context.  lo  establish 
syllable  boundaries  when  consonants  can  not  do  the  separating.  Ilrese  classes  in 
increasing  order  of  dilliculty  arc  nasals.  /I/  and  /w/,  vowel-vowel  sequences,  and 
/r/.  Characterizing  energy  and  spectrally  weighted  conlouis  allows  transitions  for 


necessarily  uniquely  identify  the  syllables,  ihe  further  segmentation  attempts  to 


Allcr  llie  parameters  are  extracted,  they  are  normalized.  Each  parameter  has  a 
predetermined  range  which  is  linearly  mapped  into  the  normal  range  as  shown  in 
Eigure  3- 1 1.  The  range  for  each  parameter  is  chosen  to  cover  all  reasonable  values 
for  that  parameter.  In  other  words,  from  the  observation  of  over  1000  utterances 
with  the  above  parameters  computed,  there  are  no  parameter  values  that  fall  outside 
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one  within  the  word.  A  covariance  matrix  requires  accumulating  statistics  over  a  . 

in  the  word  such  as  the  first  syllable  in  the  word  Massachusetts. 

large  number  of  wo.' 1  -<ilar  to  the  clustering  algorithm. 

After  the  stressed  syllables  are  assigned,  the  duration  and  energy  of  the  unstressed 
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the  location  o I  stress  are  correct  in  90%  of  the  words. 


rules  by  which  lo  expand  die  lexicon.  It  is  necessary  to  achieve  a  more  elaborate  appropriate  for  sorting  the  data  points,  or  syllable  units,  into  bins  of  stress  and 

understanding  of  the  relationship  betweer  the  acoustic  realizations  and  tire  unstressed.  The  current  algorithm  provides  a  good  means  of  feature  extraction,  but 

phonological  rules  that  emulate  them.  'Iltere  is  a  difficult  trade-off  between  writing  for  continuous  speech  the  comparison  of  the  feature  vectors  and  the  decision 

rules  that  create  all  the  variations  observed  and  writing  a  set  of  coherent,  general,  process  needs  lo  be  more  generalized, 

and  justifiable  rules.  It  is  certainly  a  difficult  problem  that  relies  on  expertise  in 
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vectors  within  a  word.  In  a  continuous  sentence  of  speech,  there  may  be  several 
points  of  stress.  Instead  of  picking  the  most  stressed  candidate,  die  algorithm  would 
have  to  choose  the  best  n  regions  of  stress.  A  technique  such  as  clustering  may  be 
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»o  that  a  new  user  is  not  required  to  train  a  system  before  using  it.  Furthermore,  we 
would  like  speech  recognition  systems  to  recognise  continuous  speech,  as  opposed 
to  isolated  words.  We  use  continuous  speech,  not  isolated  words,  when  we  speak; 
therefore,  a  continuous  speech  recognition  system  is  more  user-friendly.  Continuous 
speech  recognition  systems  have  the  added  advantage  that  users  could  enter  infor- 


Sound.  When  both  a  noise  and  voicing  source  are  present,  voiced  consonants  (e  g.,  sounds  are  allowed.  Thus  given  a  sequence  o'  sounds,  one  can  deduce  whether  or 

/v/,  ft/)  are  produced  Mao;'  speech  scientists  (e  g.,  Chomsky  and  Hal's,  1968;  uot  it  could  be  a  woid  in  a  specified  lauguage. 

Jakobson,  Kant,  and  Halle,  1952)  have  described  speech  sounds  in  terms  ol  these  The  example,  in  this  section  have  briefiy  introduced  some  low  level  speech 

characteristics,  that  is,  voiced  or  unvoiced  chara.  teristics,  and  other  characteristics  characteristics.  These  characteristics  can  be  organised  as  low  level  speech  kuowl- 
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was  used  iu  developing  the  (rout  end  descriptors  for  each  of  the  phonetic  libels  and  1-2.7  Speech  Knowledge  in  Recognition  Systems 

IU  developing  phonological  rules  Language  knowledge  was  used  to  £ad  the  best  We  have  seen  several  approaches  to  speech  recognition,  each  using  varying 

pail.  Ihiuugl.  the  word  lattice  and  iu  top  down  verification.  However,  the  methods  amounts  of  speech  kuowledge.  Template  matching  techniques  use  constraints  to 

used  to  incorporate  speech  knowledge  into  the  system  were  not  fully  explored.  For  define  a  manageable  task  (e  g.,  speaker  trained  and  isolated  word  tasks)  aud  are 


acoustic  phonetic  constraints  may  be  important.  speech  recognition  task  such  as  the  digits,  a  recognition  system  can  initially  process 

Each  of  these  systems  has  contributed  to  our  understanding  of  how  to  use  speech  continuous  speech  at  the  more  robust  broad  class  level,  rather  than  at  the  detailed 

knowledge  in  speech  recognition.  However,  we  still  need  to  understand  better  the  level  of  the  benchmark  systems.  This  ph  losophy  was  used  in  the  development  of 

constraints  provided  by  different  types  of  speech  knowledge,  especially  low-level  the  recognition  model. 


for  applying  sequential  constraints  to  uatural  speech  using  knowledge  of  front  end 
characteristics  is  developed.  Fiually,  an  investigation  of  the  application  of  path, 
ailophouic,  aud  duratioual  constraints  is  preseuted. 


described  m  the  following  sections. 


_  .  .  ,  .  ,  ,  ,  |  •  ,l.  candidates  reduces  the  computation  needed  in  further  processing. 

sWainta,  defining  broad  claaaes  should  be  applied  early  in  the  recognition  process  v  r  8 

became  these  robust  descriptors  provide  strong  constraints  in  narrowing  down  the  1°  »  recognition  system,  sequential  constraint,  can  be  applied  at  the  broad 

word  candidates.  In  addition,  these  constraints  require  little  computation  and  no  phonetic  or  detailed  phonetic  level  to  propose  word  candidates.  Shipman  and  Zue's 
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pronunciations  of  “eight*  and  the  context,  if  required,  for  each  pronunciation  are 

2.3.4  Knowledge  of  Allopkonic  Variation  .hown  io  the  firat  two  column.  ol  Table  2.1  Note  that  ‘eight’  may  be  pronounced 

with  a  released  or  unreleased  /t/  in  any  environment;  but  “eight"  is  pronounced 
Depending  upon  the  context  of  a  phoneme.  many  different  realixationa  of  that  with  a  Happed  / 1/  only  when  the  following  word  begina  with  a  vowel.  Tbna  an 


>TL*  release  of  a  /t/  actually  consists  of  a  burst  followed  by  aspiration.  la  this  thesis,  the  broad  makes  its  output  potentially  robust  with  regard  to  speaker  variability.  The  broad 

class  of  "fricative*  was  jeoeraiiied  to  include  aspiration  in  addition  e«.  fricative  sounds.  class  segmentation  is  then  input  into  the  lexical  component  where  the  phonetic  Iran- 


Note  that  the  low-frequency  energy  (energy  125-750  Us)  is  highest  in  vowel  and 
sonoraut  regions.  This  is  because  Fj,  ‘(and  possibly  a  nasal  formant)  is  present 
during  the  .oduction  of  vowels  and  voiceu  ^anor^nts.  Thus  low-frequency  energy 
is  a  good  indicator  of  voiced  regions. 


The  high  low  regions  were  found  using  an  algorithm  which  depends  on  two 

thresholds,  Tl  and  T2,  to  locate  a  region  and  then  define  the  edges  of  a  region.  By  Tl  =  cl(mazm  -  nun.)  -f  m»n, 


(or  smoothing  because  it  preserves  the  edges  of  the  contour  while  smoothing  small 

irregularities  (compare  (a)  and  (c)),  resulting  in  a  cleaner  first  difference  of  the  3.1.3  Broad  Phonetic  Labeling 

linearized  parameter  (d).  because  onsets  in  the  low  frequency  energy  contour  are  Broad  phonetic  labeling  uses  a  se,  of  production  rules  to  deduce  possible  broad 

sharper  and  can  be  more  robustly  detected  than  offsets,  they  are  located  where  the  classes  front  the  chosen  set  of  acoustic  features.  The  hypothesized  broad  classes 


The  first  set  of  12  production  rules  hypothesizes  each  segment  to  be  zero  or 
more  phonc-likc  classes,  based  upon  the  presence  or  absence  of  combinations  of 
nou-couliicting  robust  acoustic  features  characterising  each  segment.  The  acoustic 
features  used  are  shown  iu  the  top  of  Table  3.3,  and  the  phone-like  classes  used  are 


€ 


The  likelihood  ratio  of  labels  »  aud  is  defined  to  be  the  product  of  the  likelihood 


tomatic  broad  class  traoscriptioD  were  aligued  using  a  simple  50%  overlap  criterion:  (or  combinations  of  utterances  and  speakers  reveals  .be  performance  to  be  similar, 

if  segment  A,  in  string  A  covered  over  half  tbe  duration  of  segment  fl,  in  string  In  cases  where  the  labels  differ,  there  are  usually  few  samples,  since  these  phones 

B,  then  segment  A,  was  mapped  with  segment  li,.  An  overlap  criterion,  rather  do  not  normally  occur  in  digits.  For  example,  voiced  /ft/  was  sometimes  used  to 

than  a  slung  alignment  was  chosen  because  the  time  boundaries  associated  with  mark  aspiration  at  the  end  of  sentence. 


allowing  for  aspiration  following  the  final  /r/  in  'four*  and  for  deletion  of  the  final 
i  losure  in  'eight*  were  used.  Each  phonetic  pronunciation  of  a  word  is  stored 
in  an  association  list  which  is  keyed  by  phonetic  transcription.  Associated  with 
each  pronunciation  is  a  structure  containing  broad  contextual  information.  This 


netic  segmentation  produced  by  the  broad  phonetic  classifier  and  knowledge  about  may  ■>»»*  never  been  observed  to  be  prouounced  this  way.  If  a  speaker  then  says 

the  words  which  form  the  lexicon.  In  the  ideal  case  where  interspeaker  and  in-  the  /f/  in  'four'  with  a  short  period  of  silence  in  the  middle,  the  system  should 

traspeaker  variations  are  minimal  and  the  broad  class  segmentation  is  accurate,  use. the  knowledge  that  it  has  seen  /(/’•  in  other  words  pronounced  this  way.  Thus 

sequential  constraints  can  be  applied  directly  to  the  segmentation  string.  That  is,  tbe  system  should  give  the  /f/  a  good  score,  rather  than  commit  a  fatal  error  by 


Not  that  an  insertion  or  deletion  occurs  1%  of  the  time  in  the  first  transition  only.  distributions  are  similar  (or  the  new  utterances  spoken  by  training  speakers  and  by 

The  total  accumulated  score  to  a  phone  and  label  pair  is  shown  under  'total.*  The  new  speakers,  indicating  the  potential  speaker  independence  of  the  approach 

score  assigned  to  a  phonetic  string  is  the  total  score  of  the  b<.st  path  This  score  is  A  word  score  threshold  can  then  be  set  such  that  all  words  with  a  score  below 

normalized  by  the  number  of  transitions  and  is  shown  as  the  fmal  score  in  the  figure.  the  threshold  are  ruled  out  as  a  viable  candidate.  If  a  word  is  pruned  as  soon  as 


training  utterances  by  training  speakers  (b)  new  utteraucis  by  traiuing  speakers  (c)  1  nree  lyPea  ol  consirainis  were  applied  following  word  hypothesis;  path  con- 

traiuing  utterances  by  uew  speakera  (d)  new  utteraucea  by  new  apeakera  straints,  durational  constraints,  and  altopbonic  constraints.  The  block  diagram  in 

Figure  3.15  illustrates  wben  each  constraint  is  applied.  For  example,  durational 
constraints  are  applied  first  to  rule  out  word  candidates  wbicb  depend  on  a  seg 


Path  conrtraiuts,  as  described  in  Section  2.3.2,  require  tbat  each  word  iu  the 
lattice  form  part  o(  a  complete  path.  Words  which  do  uot  have  a  legal  “next  word* 
aud  “previous  word*  are  pruned  (rom  the  lattice.  The  "next  word*  can  be  either  a 
word  w hich  begius  w here  the  current  word  euds,  the  end  of  the  sentence,  or  a  word 
that  has  an  iuitial  phoue  which  could  acoustically  geminate  with  the  final  phoDe  of 


Table  3.4:  Cutoff  Points  of  Segment  Duration 


The  broad  pliouetic  classifier  scgmeuts  aud  labels  speech  into  broad  phonetic 
classes  usiug  a  set  cd  productiou  rules  applied  to  coarse  acoustic  features. 
The  broad  acoustic  features  characterising  the  speech  signal  are  defined  by 
identifying  robust  regious  and  then  extending  outward. 


■cs ,  phones  were  chosen  as  the  basic  recognition  unit  because  they  are  suitable  lor 
defining  many  types  of  phonetic  features  and  because  of  potential  extendability  to 
other  recognition  tasks.  By  representing  each  of  the  word  hypotheses  as  a  sequence 
of  phones,  features  can  be  developed  to  characterize  phones  rather  than  whole 
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upon  robust  information.  In  contrast,  a  non-acoustic  phonetically  based  preproces-  cause  the  spectral  representation  has  not  been  abstracted  to  capture  the  speaker- 

sor  attempts  to  screen  word  candidates  primarily  on  detailed  special  information.  independent  information.  PTW  and  VQ  attempt  to  handle  speaker-independence 

The  purpose  of  the  preprocessor  is  to  rule  out  unlikely  word  candidates;  attention  with  use  of  multiple  templates.  In  a  network  model,  multiple  paths  may  be  needed 

to  fine  phonetic  differences  at  an  early  point  in  processing  is  not  only  unnecessary  to  r'Preseut  different  types  of  speakers.  Thus  each  approach  attempts  to  achieve 

but  is  also  not  as  robust.  Fine  phonetic  differences  are  not  as  robust  because  fine  speaker-independence  by  capturing  variations  in  sounds,  rather  than  abstracting 
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In  ad  acoustic-phonetic  approach  to  continuous  speech  recognition,  the  phone 
„  _  #  ^  f  ,  . .  endpoints  are  located  before  the  computation  of  recogniti03  scores.  Thus  recog- 

5.7  Computational  Considerations 

nition  scores  are  computed  for  only  one  set  of  endpoints.  This  approach  can  be 
Recognition  .yatems  are  currently  limited  in  part  by  the  amount  of  computa-  contrasted  with  template-based  approaches  which  try  to  6nd  the  best  set  of  end 


word  hypotheses.  To  extend  the  model  to  continuous  speech,  hroad  class  sequential  using  a  set  of  production  rules  to  produce  a  broad  phonetic  segmentation  Im 

constraints  are  used  to  hypothesize  words  and  also  to  hypothesize  corresponding  plementation  also  demonstrated  an  alternative  to  earlier  segmentation  algorithms, 

word  boundaries.  Performing  word  hypothesis  from  a  hroad  phonetic  segmentation  Rather  than  assigning  labels  to  each  frame  and  then  grouping  the  labels  to  f 

is  in  contrast  to  earlier  continuous  speech  recognition  systems,  such  as  HWIM  and  segments,  rohust  regions,  similar  to  islands  of  reliability,  were  identified  and 
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I«V«I  epeech  knowledge:  Characteristic,  about  speech  derived  from  the 
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preceding  a  nasal  coupon  ant  [2l|.  The  amount  of  coart, culaled  nasalisation  -me  of  this  work  in  order  to  put  the  acoustic  study  of  this  research  into  belter 

depeuds  upon  the  particular  language  and  dialect.  Siuce  anticipatory  nasalisation  perspective, 

is  common  in  American  tuglrsh  |27).  a  sequence  of  a  vowel  plus  a  nasal  consonant 

(VN)  may.  in  many  siluatious.  be  pronounced  as  a  sunple  uasalued  vowel,  or  a  Analysis  and  Synthesis  Studies 

nasalised  vowel  plus  a  short,  residual  nasal  murmur.  This  is  especially  true  of 


performances  have  the  potential  to  he  auhstautrally  better  than  are  preseutly 
obtained  ur  practice. 


€ 


A  study  using  naturally  spoken  words  however,  provides  greater  insight  into  the 
acoustic  characteristics  of  sounds  in  fluent  speech.  Also,  any  quantified 


m 


consonant  dural  jus  io  many  different  environments. 


2.2.3  Analysis  of  Spectra  F™'1"  (Dl,'T)  Th*  aPecUa  were  compuUd  ever*  5  matc-  “d  wert 

smoothed  by  windowing  the  cepstra  with  a  low-pass  window  that  is  constant  for 

Perhaps  I*  most  interesting  aspect  of  the  acoustic  study  is  the  study  of  ihe  ihe  first  1.5  msec,  and  cosiue  tapered  for  the  next  1.5  msec.  Figure  2.2  illustrates 

spectral  characteristics  of  the  nasal  cousonants  and  the  nasalised  vowels.  The  an  unsmoothed  and  a  smoothed  version  of  a  DFT,  taken  from  a  nasalized  / i/  U* 

spectral  analysis  performed  in  this  research  is  carried  out  in  two  steps.  In  the  first  ihe  worj  technique.  A  discussion  on  the  issues  involved  in  spectral  analysis  may 


context  of  interest,  SpireX  searches  through  the  catalog  for  instances  of  the  statistical  analysis.  Thus,  for  the  nasal  stop  example,  the  voicing  computation 

desired  phonetic  context.  Each  such  instance  is  known  as  a  samp/e.  The  phouetic  could  be  used  to  filler  the  sample  set.  This  would  allow  the  user  to  separate  the 

context  is  specified  as  a  sequeuce  of  named  regions,  each  of  which  cousists  of  a  statistics  of  the  nasal,  and  vowel  duration  compulations  of  voiced  stops  from  those 

giveu  phouetic  pattern.  Fur  example,  a  regiou  could  specify  a  class  of  phouemes,  a  of  voiceless  stops, 

specific  phoueiue,  or  a  mote  complicated  pattern.  Thus,  to  collect  a  sample  set  of 


I*;  v  v 


1  V  V  V  " 
►  *  »  » 


Data  analysis  is  perfuriued  using  the  SpireX  statistical  analysis  facility. 


established  fur  distinguishing  voicing  in 


r 


Figure  3.3:  Voicing  Discrimination  in  Stop  Nasal  Consonant  Sequences 


The  first  energy  experiment  conducted  measured  an  energy  difference,  .a  let  dated 
hy  subtrac  ug  the  average  total  energy  in  the  nasal  murmur  from  the  average 
total  euergy  in  tbe  adjacent  souorant.  Figure  3  10  contains  a  histogram  <-t  this 


the  energy  of  medial  nasals  is  quite  stt.« 


average,  or  a  spectral  weighting  fuuctiou,  such  as  the  center  of  mass.  rtguce  3.21 
displays  a  histogram  of  the  average  deviation  of  the  normalised  low  frequency 
energy  (below  1000  llx).  The  distribution  of  voice  bars  is  also  displayed  for 
comparison  figure  3  27  presents  a  statistical  summary  of  tins  measurement  for 
*iiuilar  boiuiiU. 


(IMatogram  Ula  Width  -  M.O  m  IO*| 

■  Largest  Spectral  Peak  in  the  Nasal  Consonant 


This  display  summarizes  the  low  resonance  percentage  of  nasal  consonants  and  similar  This  display  summarizes  the  Inw  resonance  amplitude  of  nasal  consonants  and  similar 

sounds  From  left  to  right,  they  are  all  nasal  consonants  (N),  Inpiids  and  glides  (CJ),  and  sound..  From  left  to  right,  they  are:  all  oasal  coosonanls  (N).  lupiids  and  glides  (0), 

voire  liars  (VII)  The  average  value  is  indicated  by  a  filled  circle  The  vertical  lines  indicate  »*«•  ,0'cc  L'u5  ,VU)  The  average  value  is  indiraled  by  a  filled  circle.  The  vertical  lines 

one  standard  dcwalinn.  and  the  open  circles  display  the  maximum  and  iiiiniuium  values.  indicate  one  r  odard  deviate and  the  open  rirrles  display  the  maximum  and  uiinimmn 

Tin-  miinh.  r  of  samples  in  each  context  are  indiraled  below  the  display  T*"  nl»"l,,  r  »t»|d«  •»  each  context  arc  indicated  below  the  display. 


iw«l  jo 


A  study  of  nasalized  vowels  ia  more  complicated  than  a  study  of  nasal  consonants. 


characteristic:!  of  nasalized  vowels. 


3.2.1  A  Study  of  Nasalized  Vowel  Duration 


significant  than  that  observed  in  the  nasal  consonants  in  the  same  circumstances.  Jjjptsy  summarizes  the  difference  in  vowel  duration  of  minimal  pairs  in  different  voicing 

contexts.  From  left  to  right  they  are:  all  nasal  consonant  dusters  (NC),  nasal  stop  dusters 
(NS),  nasal  fricative  clusters  (NF).  Tlic  average  value  is  indicated  by  a  filled  circle.  T!ie 
vertical  lines  indicate  one  standard  deviation,  and  opea  circles  display  the  maximum  and 
in  ini  mum  values  Tlie  number  of  samples  in  each  context  .arc  indicated  below  the  display. 


resonance  was  found  to  he  the  best  indication  of  nasalization. 

Figure  3.34:  Overlay  of  Nasalized  and  Non  nasalized  /!/ 

In  summary  then,  hy  observation  of  spectra,  a  set  of  qualitative  characteristics  of  This  display  contains  spectra  of  the  vowel  /!/  taken  from  the  words  W.  and  f«*n.f«e 

vowel  nasalization  was  proposed.  Due  to  the  variability  of  the  environment,  none  The  light  line  is  lor  the  vowd  in  the  non- nasalized  context. 


Tliis  display  indicates  relative  differences  in  standard  deviation  between  nasalized  vowels 
and  their  non- nasalized  coinii er parts.  The  horizontal  coordinate  of  a  vowel  U  its  value  in  a 
nasal  context.  The  vertical  c«M>nlinate  of  the  vowel  is  its  value  in  a  similar,  but  non-nasal, 
context. 


Thu  display  imln.ites  relative  ditfercnccs  in  tuitiiiuiuii  percentage  between  nasalized  vowels  This  display  indicates  relative  differences  in  moouice  difference  between  nasalized  vow  dr 

and  their  non- nasalized  cooiitcrp>uts  The  horizontal  coordinate  of  a  vowel  is  its  value  in  a  and  their  non -nasalized  counterparts  The  horizontal  coordinate  of  a  vowel  is  its  value  in  a 

nasal  cuiih'ii  The  vertical  coordinate  of  the  vowel  is  its  value  iu  a  sin  liar,  but  uou-uasal,  nasal  tout  ex  l  The  vertical  coordinate  of  the  vowd  is  its  value  in  a  suinlar.  but  iiuu  ua-utl. 

context  context. 


On«  minimal  pair  experiments  had  established  some  relative  results,  distributions  .  The  statistical  distributions  of  the  measuIe  of  difference,  shown  in  figure  3.47,  are 

were  made  for  all  of  the  vowels.  It  was  found  useful  to  retain  a  high-low  ;  perhaps  the  most  difficult  to  interpret  since  they  appear  to  overlap  The  idea 

distinction  in  the  distributions  however,  since  low  vowels  tended  to  have  a  more  i  hel.ind  this  meaallre  wa3  that  M  the  extra  resonanc<.  hecaIne  stronger,  the 


The  variability  of  vowel  spectral  shapes  hindered  the  study  of  nasalization.  Since 
the  main  area  of  interest  was  in  the  first  resonance  region,  the  difficulties  lay  with 


strength,  properties  wliicli  were  quaalifieii  by  the  measure  of  low  resonance 
percentage:,  ami  low  resonance  height,  respectively. 


(non  nasalized).  Note  that  the  evaluation  procedure  of  the  nasalized  vowel 
detection  system  is  not  a  true  judge  of  nasalization,  since  some  vowels  in  a 
non  nasal  context  will  he  nasalized,  while  some  vowels  in  a  nasal  context  will 
hardly  he  nasalized  at  all.  A  better  evaluation  measure  would  he  to  compare 


As  was  the  ca  e  for  the  nasal  consonants,  an  initial  analysis  was  performed  to  Tahle  ^  3.  Nasalised  Vowel  Detection 

determine  which  of  the  three  methods  performed  the  best.  Once  again  for 


c 


of  standard  deviation.  The  acoustic  study  also  established  that  it  is  possible  to 
discern  relative  degrees  of  nasalization  by  measuring  the  relative  strergtb  of  the 
extra  resonance  to  the  first  resonance,  and  by  measuring  the  amount  of  time  that 
it  is  present  iu  the  vowel. 


vv*. 


11)  Dixson,  NR.,  Silverman,  H.F.,  “A  General  Language  Operated  Decision  Im¬ 
plementation  System  (GLODIS):  Its  Application  to  Continuous-Speech  Seg¬ 
mentation,  IEEE  Transactions  ASSP ,  Vol.  24,  pp.  137-162,  1976. 

12|  Faiibanks,  G.,  House,  A  S.,  Stevens,  A.L.,  “An  Experimental  Study  of  Vowel 
Intensities,  Journal  of  the  Acoustical  Society  of  America,  Vol.  22,  No.  4,  pp. 
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The  must  difficult  boundaries  to  establish  were  between  nasal  consonants  and 


Fi«me  U  2.  Tlic  Traoscriplioa  of  the  word  unmet  Figure  B  3:  The  Transcription  of  the  word  smack 


Figure  B A:  The  Transcription  of  the  word  warms  B  5:  The  Transcription  of  the  word  Uniin, 


Figure  U  6:  The  TYauscriptiou  of  the  word  .nrnaJ-  Figure  B.7:  The  TVanacription  of  the  word  kinnoct 


contains  DFT  spectra  for  two  different  duration  hamming  windows 
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iys  contain  the  hamming  windows  and  the  original  speech  wavcrorm. 
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Whenever  there  were  two  resonances  in  the  spectrum,  the  resonance  difference  was 
calculated  by  measuring  the  difference,  in  dB,  between  the  second  resonance,  a*'d 
the  first  resonance.  Note  that  no  attempt  was  made  to  determine  which  resonance 
was  the  nasal  resonance.  Thus  for  high  vowels,  the  resonance  difference  was 
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Introduction 


acoustic  speech  signal  is  a  very  difficult,  if  not  impossible,  goal  to  achieve.  A  more  example.  Ike  transition  horn  a  vowel  to  a  post-vocalic  III  as  in  the  word  'dark*'  b 

realistic  goal  is  to  locale  the  acouslic  landmarks  that  correspond  to  the  plw  nclk 

transcription  consistently  and  robustly.  This  b  the  objective  of  the  thesis.  ® 
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the  9ion|  nicalivw  has  been  located,  the  time  span  of  the  Am  live  phonetic  cvenu 


pioitcd  with  a  limited  vocabulary  die.  Ihetefoie.  they  we  both  assigned  a  'sonoionl’  label  in  the  ideal  segmentation. 

Fuilheimoie,  adjacent  segments  with  the  same  label  aie  collapsed  into  one  long 
segment  with  the  same  label.  Situs,  although  the  segmentation  b  ideal,  it  iciains  the 


have  an  acoustic  realitation  very  different  from  an  obstruent  Nevenheleat  ihu 


comeli-independent  the  segmentation 


phonetic  conteits. 

After  the  alignment  is  petfomted.  the  time  tnatks  of  the  phonetic  liansaiplioni, 

-A  feasibility  study  iliusttaics  that  a  sequence  of  broad  manner  classes 
which  ate  obtained  ftom  the  Ideal  segmentation,  are  then  used  to  check  if  the  can  pj  jyiJe  eno.  gh  constraints  to  time  align  an  utterance  with  Its 

cottesponding  phonetic  lianscilption. 


unit  11, ese  ,wo  problem*  are  made  more  difficult  due  ,o  acoustic  difference*  fea.ure  space.  ihen  the  class  I  region  In  the  feature  space  b  the  *e«  of  poinls  fat 

between  speakers,  and  to  coaniculalion.  In  some  cases,  tSe  inter  speaker  acoustic  which  d(U)  <  d(j,*).  for  all  J  not  equal  to  L  As  can  be  seen  in  Figure  )  I.  if  a  single 


nonlinear  chiiacurisuc  can  be  appro*  imateif  bv  •  set  of  piecewise  linear  functions 


3.2  2.1  Foalura  Exliaction  and  Smoothing 
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f  iguia  10:  Ihe  realities  used  at  die  nodes  of  the  classification  Uee. 


3.2  2.3  A  2  mean*  Cluttering  Algorithm 

Every  S  im.  an  M  dimensional  feature  vector  U  obtained.  All  the  feature  vector! 


however.  these  initial  centroids  can  be  selected  lntellilentiy.  since  the  general  ^  M,(einc  m„ked  by  lhe  uiang)e3|  dusterin,  algorithm  with  Euclidean 

properties  o(  tlte  samples  ;n  the  M  dimenslonal  space  are  Inown  For  eiampie.  distance  metric. may  converge  and  draw  the  decision  boundary  m  hown  by  the 


some  feature  distributions  have  a  small  number  of  oullien  and  thui  resulting  In  spectrogram.  Hie  tabeh  of  each  of  the 


"insertion  eyoi"  Although  no  farther  segmentation  fa  needed,  this  kind  of 
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Two  path  finding  sralcgics  have  been  investigated.  One  b  based  on  the  piindpie 


of  dynamic  prog.-mming  and  die  other  is  based  op.  blanch  and  bound  (Winston  «(. 
Uie  principle  of  dynai  ic  programming  h  as  Allows.  If  a  stale  can  be  reached  from 
the  start  node  in  difTcren*  ways.  U  -n  the  state  is  only  associated  with  the  partial  path 
that  has  the  smallest  cost  Ail  the  other  pailial  pad  s  to  that  stale  would  be 


vis  2  different  paths  However,  each  one  is  ass'-ciaied  with  a  different  cost  Ihe  use 
of  dynamic  programming  will  disced  the  one  that  has  a  higher  cost  of  7  and  only 
keep  die  one  that  has  a  Sower  cost  of  1  Unfotrun  tely.  the  one  with  the  lower  cost 
can  never  reach  die  goal  node  G. 
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principle  of  dynamic  programming.  whereas  the  second  trace  shows  die  result  after 


arc  i(  die  <cro  trussing*  of  die  filtered  feature  parameter. 


Acoustic  landmarks  between  phonetic  events.  Some  landmarks  ore  evidenced  by 

sentences,  were  manually  labeled  by  a  second  acoustic  phonetician.  The  cumulative 

distinct  Acoustic  cues,  whereas  others  are  more  subtle  and  even  hand  alignment 

distiibutlonofihe  boundary  olJset  between  the  two  phoneticians  is  shown  by  curve  ^  t  , 

cannot  locate  the  landmarks  reliably.  Thus  it  b  important  to  evaluate  the  automatic 

B  .  Since  it  is  difficult  to  sny  exactly  where  •  boundary  should  be.  this  curve  gives 


Figu<«  <3;  A  table  showing  mapping  of  the  phonetic  uonsitions  into 
"hatd"  end  "soft"  cilegodes. 
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transciiptions  must  also  be  emeied  Imo  the  system  by  an  tiperienced  acoustic-  the  corresponding  dashed  Bne. 

phonetician.  After  the  automatic  alignment  is  petfotmed,  the  output  can  be 
checked  by  the  acoustic  phonetician  and  the  acoustic  landmarks  can  be  cdjusted.  if 


-  Die  young  girl  gave  no  dear  response.  I  -  A  tusk  b  used  to  make  cosdy  gifts. 


*  Die  fin  was  sharp  and  cul  the  dear  water. 


Shipman,  l)  W.  and  Zue,  V.W. 

Piopeitits  of  Large  I  ei  icons:  implications  (or  Advanced  Isoiaicd  Word 
Recognition  Systems. 
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The  study  reported  in  this  paper  is  concerned  with  the  determination  of  lexical  stress  for 
isolated  words  from  the  acoustic  signal.  It  is  motivated  by  two  observations.  First,  it  has 
long  been  suggested  that  stressed  syllables  represent  islands  of  reliability  where  the  acoustic 
cues  for  phonetic  segments  are  much  more  robust.  Evidence  for  this  observation  has  come 
from  diverse  sources.  For  example,  phoneme-monitoring  experiments  for  human  speech 
perception  have  shown  that  reaction  time  is  shorter  for  sounds  in  stressed  syllables  than  for 
those  in  unstressed  syllables.  Analysis  of  human  spectrogram  reading  results  also  indicates 
that  accuracy  is  higher  for  sounds  around  stressed  syllables.  In  addition,  automatic  speech 
recognition  "front-ends"  typically  recognize  phonemes  around  stressed  syllables  more 
accurately,  again  suggesting  that  the  acoustic  cues  in  these  regions  are  more  reliable.  As  an 
illustration,  consider  the  spectrograms  for  the  word  pair  CONtract/conTRACT  shown  in 
this  figure.  We  can  see  that  the  characteristics  for  vowels  and  consonants  around  stressed 
syllables  are  more  distinct. 

A  second  reason  for  investigating  lexical  stress  stems  from  the  results  of  a  set  of 
experiments,  suggesting  that  there  are  strong  constraints  on  the  allowable  sound  patterns  in 
the  English  language.  These  studies  have  shown  that,  when  words  in  a  lexicon  are 
represented  in  terms  of  broad  manner  classes,  the  number  of  words  sharing  the  same 
broad-class  pattern  is  often  very  small.  In  fact,  when  the  broad  class  representation  is 
augmented  with  stress  pattern  information,  the  number  of  word  candidates  for  a  given 
representation  is  furtbf  r  reduced.  As  an  example,  the  two  words  "campus"  and  "compose" 
shown  in  this  figure  have  the  same  broad  class  representation  [STCP,  VOWEL,  NASAL, 
STOP,  VOWEL,  STRONG  FRICATIVE],  but  they  can  be  differentiated  by  the  stress 


C  patterns. 


Another  finding  of  the  studies  we  just  cited  is  that,  even  if  all  the  phonemes  can  be 
determined  without  error,  stressed  syllables  still  provide  more  lexical  constraints  than  the 
unstressed  ones.  All  these  results  seem  to  suggest  that  determination  of  the  stress  pattern  of 
a  word  is  potentially  useful  for  speech  recognition.  We  should  emphasize  at  the  onset  that 
we  are  only  interested  in  lexical  stress,  namely  the  stress  pattern  of  words  spoken  in 
isolation.  Sentential  stress  adds  another  level  of  complexity  to  the  problem,  and  is  not 
being  addressed  in  our  study. 

Our  present  study  focuses  on  two  related  issues.  First,  considering  prosodic  information 
as  a  separate  source  of  knowledge,  we  investigate  the  amount  of  lexical  constraint  provided 
by  stress  information  alone.  In  other  words,  we  want  to  know  by  how  much  can  the 
number  of  word  candidates  be  reduced  if  only  the  stress  patterns  of  words  are  known. 
Second,  we  implement  a  system  that  derives  the  stress  information  of  a  word,  based  on  a  set 
of  measurements  made  from  the  acoustic  signal. 

In  order  to  determine  the  lexical  constraints  provided  by  stress  information,  we  created  a 
lexicon  from  the  Merri am- Webster  Pocket  Dictionary,  consisting  of  all  the  two-,  three-,  and 
four-syllable  words.  The  words  in  the  lexicon  were  then  mapped  into  their  corresponding 
stress  pattern  classes.  For  this  study,  we  allowed  each  word  to  have  only  one  pronunciation 
and  one  stress  pattern.  We  adopted  a  three-level  convention,  stressed,  unstressed,  and 
reduced,  where  a  word  must  have  one  and  only  one  stressed  syllable.  The  results  of  the 
study  are  summarized  in  this  figure.  Looking  at  the  left-hand  column  of  numbers,  we  see 
that  knowledge  of  the  number  cf  syllables  of  the  words  will  give  an  expected  class  size 
equal  to  approximately  37%  of  the  size  of  the  lexicon.  When  the  stress  pattern  is 


completely  known,  that  is  the  correct  number  of  syllables  and  the  correct  assignment  of 
stress  for  each  syllable,  as  shown  in  the  middle  column,  the  expected  ciu-3  size  is  reduced  to 
19%.  In  other  words,  we  can  expect  to  reduce  the  word  candidates  by  a  factor  of  five  from 
stress  information  alone.  Comparing  the  middle  column  to  the  right-hand  one,  it  is 
interesting  to  note  that  knowledge  of  just  the  number  of  syllables  and  the  location  of  the 
most-stressed  syllable  provides  about  as  much  constraints  as  the  entire  stress  pattern. 

Having  determined  the  constraining  power  provided  by  stress  information,  we 
proceeded  to  develop  an  algorithm  to  automatically  derive  the  stress  pattern  from  the 
acoustic  signal..  We  note  that  the  acoustic  correlates  of  stress  have  been  studied  extensively 
by  many  researchers  in  the  past,  some  of  these  studies  are  shown  in  this  figure.  Most  of  the 
studies  have  found  that  lexical  stress  is  well  correlated  with  the  duration,  the  fundamental 
frequency  value,  and  the  intensity  of  the  syllable  nuclei.  Lieberman  in  fact  developed  a 
system  to  automatically  determine  the  stress  pattern  of  bi-syllabic  words.  Prior  to  the 
actual  development  of  the  stress  determination  system,  we  also  studied  the  acoustic 
correlates  of  stress  based  on  a  database  of  350  words,  spoken  by  7  talkers,  3  male  and  4 
female.  Our  measurements  generally  agree  with  those  found  by  previous  researchers. 
There  are  two  observations  worth  noting.  First,  we  found  that  prepausal  lengthening  is  an 
effect  that  must  be  properly  compensated.  Second,  the  measurements  were  nearly  as 
effective  in  separating  stressed  syllables  from  unstressed  ones  if  sonorants  adjacent  to  the 
vowels  were  also  included.  This  second  observation  is  important,  since  it  is  sometimes 
difficult  to  delineate  sonorants  from  adjacent  vowels  automatically. 


The  structure  of  the  system  that  we  have  developed  for  stress  determination  is  shown  in 


the  next  figure.  The  input  to  the  system  is  the  acoustic  signal,  digitized  at  16  kHz.  The 
system  has  two  main  components.  The  first  is  a  syllable  detection  component  which 
establishes  the  sonorant  regions  of  the  syllables.  Next,  the  stress  algorithm  examines  the 
syllables  within  the  word  and  derives  a  stress  pattern.  Thus,  each  derived  syllable  unit  is 
labeled  as  stressed,  unstressed,  or  reduced. 

The  next  two  figures  discuss  the  system  components  in  more  detail.  Syllable  detection  is 
accomplished  in  two  stages.  There  is  an  initial  segmentation  which  provides  a  description 
of  the  signal  in  terms  of  broad  classes,  such  as  sonorants,  obstruents,  etc.  This  classifier, 
developed  by  Leung  and  Zue,  uses  a  number  of  speech  parameters  to  classify  the  signal  on 
a  frame  by  frame  basis.  The  classifier  uses  a  number  of  sub-classifiers,  arranged  in  a  binary 
decision  tree.  At  each  node  in  the  tree,  a  k-means  clustering  algorithm  classifies  each  frame 
into  one  of  two  categories  based  on  a  specially  selected  feature  set.  Once  sonorant  regions 
are  established,  the  second  stage  of  the  syllable  detection  further  examines  these  regions  for 
possible  syllable  boundaries,  such  as  vowel/vowel  or  vowel/sonorant. 

As  an  example  of  the  syllable  detection  procedure, this  next  figure  shows  spectrograms 
and  the  initial  segmentation  boundaries  marked  in  blue  for  the  words  "Massachusetts", 
"yellow",  and  "create".  For  the  first  word,  "Massachusetts",  each  sonorant  region  derived 
by  the  initial  classifier  corresponds  to  a  syllable  unit.  No  further  work  is  necessary. 
However,  "yellow"  and  "create"  have  an  intervocalic  sonorant  and  a  vowel/vowel 
transition,  respectively,  not  detected  by  the  initial  sonorant  classification.  The  second  stage 
of  syllable  detection  examines  the  contextual  information  within  these  sonorant  regions 


and  establishes  additional  syllable  boundaries  shown  in  red. 


The  second  component  determines  the  stress  pattern  from  a  set  of  measurements  made 
from  each  syllable.  Measurements  include  duration,  energy  in  different  frequency  bands, 
and  fundamental  frequency,  similar  to  previous  work.  In  addition,  a  spectral  change 
measure  reflects  the  amount  of  spectral  stability  over  time.  Duration  is  measured  by 
including  the  vowel  and  any  surrounding  sonorants  as  determined  by  the  syllable  detection 
component.  An  average  of  energy  and  spectral  change  is  computed  over  each  syllable. 
Finally,  the  maximum  of  fundamental  frequency  over  each  syllable  is  used. 

The  next  phase  of  stress  determination  is  the  decision  algorithm.  First,  the  above 
parameters  are  compensated,  such  as  for  prepausal  lengthening,  and  normalized.  Each 
syllable  is  then  associated  with  a  feature  vector.  The  assignment  of  stress  is  based  on  a 
relative  comparison  of  the  feature  vectors  for  all  the  syllables  in  a  given  word,  and  does  not 
rely  on  an  absolute  targets  for  stressed  and  unstressed  syllables  obtained  through  a  training 
set.  As  a  result,  the  algorithm  provides  an  implicit  timing  normalization  for  differences  in 
word  lengths  and  number  of  syllables,  and  is  relatively  insensitive  to  inter-  and  intra- 
speaker  variabilities.  The  "optimum"  parameter  value  across  the  word  for  each  parameter 
forms  an  extremum  in  the  feature  space.  A  Euclidean  di  lance  from  each  syllable  vector  to 
the  extremum  is  computed.  The  syllable  with  the  minimum  distance  is  then  labeled  as 
stressed.  Reduced  decisions  are  made  by  reexamining  the  duration  and  energy  in  the 
remaining  unstressed  syllables.  Thus,  the  output  of  the  system  is  a  time-aligned  stress 
pattern  for  the  input  word. 

The  system  is  evaluated  on  a  corpus  of  1600  isolated  words,  consisting  of  2,3,4,  and  5 
syllable  words.  There  are  a  total  of  4500  syllable  tokens  in  the  corpus.  In  addition,  11 
sneakers,  both  male  and  female,  are  included  in  the  evaluation. 


The  results  of  the  evaluation  can  be  divided  into  three  criteria,  in  increasing  order  of 
difficulty.  The  first  is  the  determination  of  the  stressed  syllable,  even  in  the  presence  of 
syllable  detection  errors.  In  this  case,  2%  of  the  words  do  not  have  the  stressed  syllable 
labeled  correctly.  The  next  criterion  imposes  the  additional  constraint  that  the  number  of 
syllables  must  also  be  correctly  identified.  With  these  more  stringent  requirements,  the 
error  rate  increases  to  10%  of  the  corpus.  Finally,  the  stipulation  that  the  stress  pattern 
must  be  correct  demands  correct  syllable  and  stress  identification,  as  well  as  no  confusion 
between  unstressed  and  reduced  segments.  These  additional  constraints  increases  the  error 
rate  to  13%  of  the  corpus.  Comparing  these  results,  we  see  that  most  of  the  errors  made  by 
the  system  can  be  attributed  to  inaccurate  syllable  detection,  rather  than  stress  assignment. 
Recall  earlier,  we  have  shown  that  the  lexical  constraints  provided  by  knowing  the  entire 
stress  pattern  is  almost  the  same  as  knowing  the  number  of  syllables  and  the  most  stressed 
one.  Thus,  the  performance  deterioration  from  10%  to  13%  may  not  be  too  serious. 


In  summary,  we  have  determined  that  lexical  stress  information  can  provide  strong 
constraints  towards  word  candidate  reduction.  The  algorithms  that  we  have  developed  can 
automaticallv  and  reliablv  determine  the  stressed  syllables  in  isolated  words.  Bv  identifying 
the  stressed  syllables,  the  system  can  provide  pointers  to  region",  where  the  acoustic 
information  is  presumably  robust,'  and  thus  improve  the  performance  of  phonetic 
recognition.  A  majority  of  the  errors  for  stress  pattern  determination  can  be  attributed  to 
an  ambiguity  associated  with  the  number  of  syllables  for  certain  pe’y-yilabic  words,  a  task 
often  difficult  for  humans  as  well.  By  incorporating  syllable  reduction  and  deletion  rules 
and  thus  increasing  the  number  of  alternate  pronunciations,  we  believe  this  system  can 
eventually  be  incorporated  in  a  large  vocabulary  speech  recognition  system. 


Figures  used  in  taik  follow  this  page. 


Motivation  I 


Acoustic  characteristics  of  speech 
Stressed  syllables. 


sounds  are  more  robust  around 


^Evidence  from: 

-  Speech  Perception  (e.g.  Cutler  and  Foss,  1977) 

-/• 

W  -  Spectrogram  Reading  (e.g..Klatt  and  Stevens,  1973) 
r.;'  -  Speech  Recognition  Systems  (e.g.  Lea,  1980) 


Motivation  II 


Stress  information  provides  strong  constraints  for  lexical  access  (Zue 
and  Huttenlocher.  1983). 
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How  Much  Lexical  Constraint  Does 
Stress  Information  Provide? 


Corpus:  Merriam-Webster  Pocket  Dictionary 

Size:  approximately  15,000  (ail  two-  through  four-syllable  words) 


#  of  Syl. 
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&  Stress 
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Stressed  Syl. 
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Class  Size 
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Past  Studies  on  Lexical  Stress 
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Lcoustic  Correlates  of  Stress: 


!(e.g.  Fry,  1955;  Bolinger,  1958;  Morton  &Jassem,  1965;  Lehiste, 
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jRecognition  System:  (Lieberman,  1960) 


Syllable  Detection 


Initial  Classification  to  Establish  Sonorant  Regions 

’"Classifier  identifies  broad  phonetic  categories 
(Leung  &  Zue.  1984) 

*  Frame-by-frame  classification  from  speech 
parameters 

*  Binary  tree-like  structure 

*K-means  clustering  algorithm 

Further  Investigation  of  Sonorant  Regions  for  Possible 
Vowel/Vowel  and  Vowel/Sonorant  Transitions 


Stress  Determination 


-  Parameters: 

*  duration 

*  energy  in  different  frequency  bands 

i 

*  fundamental  frequency 

*  spectral  change 

-  Algorithm: 

*  Measurements  are  normalized  and  compensated 

*  Comparisons  are  made  among  syllables  within  a 
word 

*  Euclidean  distance  measured  from  extrema  in  the 
feature  space 
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Evaluation 


Corpus 


#  of  Words  1600 

#  of  Syllables  4500 


#  of  Speakers  11  (6M,  5F) 
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Syllables 
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I’d  like  to  talk  to  you  today  about  some  work  we  have  been  doing  which 
is  concerned  with  the  acoustic  characteristics  of  nasal  consonants  in  Amer¬ 
ican  English.  There  are  actually  two  goals  that  we  are  trying  to  achieve  in 
this  work.  First,  we  are  attempting  to  quantify  some  of  the  acoustic  char¬ 
acteristics  of  the  nasal  consonants  in  American  English.  Although  nasal 
consonants  have  been  studied  extensively  in  the  past  by  many  researchers, 
we  feel  that  we  can  still  contribute  to  this  body  of  knowledge  by  trying  to 
quantify  some  of  the  basic  acoustic  properties  of  nasal  consonants  from  a 
reasonably  sized  database.  The  second  goal  of  this  work  is  to  apply  any 
robust  acoustic  characteristics  which  we  are  able  to  establish  to  the  field  of 
automatic  speech  recognition. 

The  acoustic  analysis  of  the  nasal  consonants  was  made  from  a  database 
specifically  created  for  this  work.  The  database  w_s  based  on  a  carefully 
constructed  corpus  of  some  two  hundred  words  which  was  designed  to  con¬ 
tain  nasal  consonants  in  many  different  contexts.  Thus  the  corpus  contained 
nasal  consonants  in  prevocalic,  medial,  and  post-vocalic  contexts.  It  con¬ 
tained  nasal  consonants  in  both  singleton  environments  and  in  consonant 
clusters.  The  corpus  also  contained  minimal  pairs  of  words  which  could 
be  used  to  study  the  subtle  differences  between  nasalized  and  non  nasal¬ 
ized  vowels,  or  between  nasal  consonants  and  other  sounds  which  might  be 
confused  with  a  nasal  consonant.  The  corpus  also  contained  minimal  pair 
words  which  could  be  used  to  study  the  differences  between  nasal  conso¬ 
nants  with  different  place  of  articulation. 

Once  the  corpus  had  been  designed  the  database  was  created.  Three 
male  and  three  female  <*oeakers  each  read  the  words  of  the  corpus  which 
were  embedded  in  a  ci  ier  phrase.  All  utterances  were  digitized  at  16 
kHz  and  the  phonetic  transcriptions  were  manually  time  aligned  with  the 
waveforms.  The  final  database  contained  over  1200  words. 

The  main  point  to  make  about  the  acoustic  analysis  of  the  database  is 
that  apart  from  the  original  time  alimment  of  the  phonetic  transcription, 
all  measurements  and  analyses  were  done  automatically  -/  machine.  One 
advantage  to  this  procedure  is  that  any  acoustic  characteristic  that  can 
be  shown  to  be  robust  can  be  immediately  used  in  an  automatic  speech 
recognition  system.  Anotner  advantage  of  automatic  analysis  is  that  it 
allows  the  analysis  of  a  large  amount  of  data  in  a  reasonable  amount  of 


time.  For  Instance,  the  size  of  the  database  was  limited  primarily  by  the 
time  it  took  to  complete  the  time  alignment.  Automatic  analysis  must  be 
done  carefully  however,  or  one  runs  the  risk  of  adding  measurement  noise 
into  the  distributions  of  the  data. 

In  our  analysis  of  the  nasal  consonants  we  were  interested  in  three  ma¬ 
jor  areas:  the  nasal  murmur  or  the  period  of  oral  closure,  the  period  of 
nasalization  in  an  adjacent  vowel,  and  the  period  of  transition  between  the 
nasal  murmur  and  an  adjacent  vowel.  Work  has  been  completed  on  the  first 
two  areas  and  we  are  presently  studying  the  transition  region.  Due  to  the 
time  constraint,  I  will  limit  my  discussion  in  this  talk  to  some  of  the  basic 
findings  of  our  work  with  the  nasal  murmur.  In  addition,  I  will  discuss 
some  preliminary  results  we  have  obtained  with  automatic  nasal  consonant 
detection  using  acoustic  information  about  the  nasal  murmur  only. 

The  first  thing  that  we  studied  about  the  nasal  murmur  was  duration. 
We  found,  as  did  many  previous  studies  of  nasal  consonants,  that  the  du¬ 
ration  of  the  nasal  murmur  is  strongly  influenced  by  phonetic  context.  Fig¬ 
ure  1  contains  a  statistical  summary  of  the  duration  of  nasal  murmurs  in 
either  a  singleton  environment,  or  in  a  cluster  with  another  consonant.  For 
each  group,  the  mean  is  indicated  by  a  filled  circle,  and  the  vertical  bam 
represent  one  standard  deviation.  The  open  circles  indie'’**  the  minimum 
and  maximum  values  for  all  of  the  samples.  As  shown  in  these  statistics, 
consonant  clusters  tend  to  reduce  the  duration  of  the  nasal  murmur.  How¬ 
ever  when  we  look  a  little  closer  we  find  that  the  actual  effect  depends 
on  the  voicing  characteristics  of  the  adjacent  consonant.  Thus,  the  nasal 
murmur  is  shortened  when  it  is  in  a  cluster  with  a  voiceless  consonant,  and 
is  lengthened  when  it  is  a  cluster  with  a  voiced  consonant.  This  result  was 
found  to  be  true  for  both  word-initial  clusters,  as  in  smack,  and  word-final 
clusters,  as  in  cant. 

Although  these  results  are  fairly  robust,  their  immediate  application 
to  spe-ch  recognition  is  limited  since  cne  would  need  to  know  the  exact 
context  to  be  able  to  apply  this  information. 

The  second  acoustic  characteristic  which  was  quantified  was  the  energy 
of  the  nasal  murmur.  The  measurement  procedure  is  illustrated  in  figure  2. 
Above  the  spectrogram  of  the  word  hammock  are  the  zero-crossing  rate,  the 
total  energy,  and  the  low-frequency  energy.  The  energy  difference  between 
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the  nasal  and  the  adjacent  vowel  was  calculated  by  subtracting  the  average 
total  energy  in  the  nasal  murmur  from  the  average  total  energy  in  the  ad¬ 
jacent  sonorant.  In  the  bottom  part  of  the  figure  you  can  find  a  histogram 
of  this  energy  difference,  plotted  in  dB.  Since  this  energy  difference  is  al¬ 
most  always  positive,  we  conclude  that  the  nasal  murmur  is  consistently 
weaker  than  an  adjacent  vowel  by  an  average  of  10  dB.  As  a  comparison, 
we  also  show  the  same  measurement  for  liquids  and  glides.  Although  the 
distributions  overlap  somewhat, it  can  be  seen  that  liquids  and  glides  tend 
to  have  less  of  an  energy  difference  than  nasals,  i.e.,  they  have  more  relative 
energy  than  do  nasal  consonants.  Thus,  from  a  speech  recognition  point 
of  view,  this  measurement  may  help  to  distinguish  nasals  from  liquids  and 
glides.  In  fact,  this  is  one  of  the  measurements  that  we  used  in  a  recognition 
experiment  that  we  will  describe  shortly. 

The  next  acoustic  characteristic  which  was  examined  was  the  spectral 
characteristics  of  the  nasal  murmur.  For  analysis  purposes  we  calculated 
statistics  based  on  a  cepstrally  smoothed  spectra  created  from  the  pre¬ 
emphasized  waveform.  The  spectra  were  all  normalized  with  respect  to 
total  energy  so  that  we  did  not  have  to  be  concerned  with  an  offset.  Dur¬ 
ing  analysis  we  also  restricted  ourselves  to  analyzing  the  spectra  of  one 
speaker  at  a  time  since  we  found  that  the  spectra,  primarily  at  frequencies 
above  1000  Hz,  were  highly  speaker  dependent.  This  may  not  be  surprising, 
since  the  size  of  the  nasal  and  sinus  cavities  can  vary  greatly  from  speaker 
to  speaker.  Statistics  were  gathered  by  collecting  multiple  spectra  from  all 
of  the  nasal  murmurs.  The  top  of  figure  3  shows  multiple  spectra  for  m 
for  one  sp^er  after  energy  normalization.  We  see  that  tl»ere  are  some 
common  characteristics  among  the  many  spectra,  and  we  tried  to  capture 
the  essence  by  averaging  the  spectra.  We  found  little  difference  between 
the  average  spectra  obtained  from  multiple  spectra  and  that  obtained  from 
an  average  spectra  for  each  murmur.  The  two  bottom  displays  illustrate 
average  spectra  for  an  intervocalic  m  fcr  one  speaker  using  these  two  differ¬ 
ent  techniques.  The  thick  line  is  tt  *  lean  spectral  shape  and  the  outer  two 
lines  are  one  standard  deviation  a\  \  As  can  be  seen,  the  average  spec¬ 
tral  jhape  is  very  similar.  The  standard  deviation  of  the  multiple  spectra, 
shown  on  the  left,  is  larger  which  is  to  be  expected  since  more  spectra  were 
included.  The  fact  that  the  two  averaging  technique  yield  similar  results 
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points  out  the  fact  that  the  spectral  characteristics  of  the  nasal  murmur  are 
quite  stable  against  time,  especially  at  low  frequencies.  For  the  remainder 
of  this  talk,  we  have  used  the  average  spectra  obtained  from  the  multiple 
spectra  technique. 

Figure  4  shows  an  average  spectra  for  an  n  for  one  speaker.  On  top  of 
this  we  have  drawn  the  average  spectra  of  m  for  the  same  speaker  and  also 
the  average  spectra  of  an  ng.  In  general  we  found  that  the  spectral  shapes 
of  the  nasal  murmur  were  highly  speaker  dependent.  Further,  although 
subtle  differences  could  be  detected  between  the  three  nasal  consonants 
for  any  given  speaker,  all  three  nasal  consonants  tended  to  have  similar 
spectral  shapes  as  we  can  see  here.  This  observation  is  in  agreement  with 
that  made  by  Fujimura,  who  also  find  little  differences  among  the  spectra 
of  the  three  nasal  murmurs.  Finally  we  found  that  the  spectral  shape  of 
the  nasal  murmur  was  relatively  unaffected  by  phonetic  context. 

In  general  we  found  that  the  nasal  murmur  spectra  were  characterized 
by  a  low  frequency  energy  which  dominated  the  spectrum.  As  illustrated 
in  figure  5,  this  low  frequency  energy  was  nearly  always  centered  between 
200  and  350  Hz.  This  characteristic  is  very  strong  in  that  if  a  spectra  does 
not  have  a  resonance  centered  in  this  frequency  range  then  it  is  likely  not 
from  a  nasal  murmur. 

As  previously  mentioned,  this  low  energy  dominates  the  overall  spec¬ 
trum  of  the  nasal  murmur.  We  found  that  the  normalized  resonance  ampli¬ 
tude  ranged  from  20  to  30  dB  for  most  spectra.  Another  characteristic  of 
the  nasal  murmur  was  a  fairly  abrupt  decrease  of  energy  in  the  frequencies 
immediately  following  the  low  resonance.  This  drop  which  we  have  labeled 
as  the  resonance  height  was  on  average  about  10  dB. 

After  observing  some  of  the  basic  characteristics  of  the  nasal  murmur, 
we  were  interested  in  determining  how  well  we  could  discriminate  between 
nasal  consonants  and  impostors  such  as  liquids,  glides,  voice  bars  and  voiced 
fricatives.  For  the  time  being  we  have  restricted  ourselves  to  using  infor¬ 
mation  in  the  nasal  murmur  alone.  Further,  we  wished  to  utilize  only 
measurements  that  are  common  across  different  speakers. 

Our  strategy  was  to  combine  five  simple  measures  into  a  maximum  like¬ 
lihood  decision  making  process.  The  five  measures  included  the  energy 
difference,  the  percentage  of  the  time  that  there  was  a  low  frequency  reso- 
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nance  centered  between  200  and  350  Hz  in  the  nasal  murmur,  the  average 
amplitude  ol  this  resonance  in  the  murmur,  the  average  height  of  the  res¬ 
onance  all  of  which  have  been  mentioned  previously.  We  also  included 
a  simple  measure  of  the  steadiness  of  the  spectrum  which  was  based  on 
normalized  low  frequency  energy. 

As  a  first  step  at  evaluation,  the  recognition  task  was  performed  using 
the  utterances  of  the  database  for  all  six  speakers.  There  were  520  nasal 
murmurs  and  605  impostor  sounds.  In  the  recognition  of  one  murmur,  all 
other  murmurs  were  utilized  for  training.  The  results  indicate  that  the 
system  can  detect  the  presence  of  a  nasal  consonant  94%  of  the  time  and 
can  detect  an  impostor  81%  of  the  time.  1  The  next  step  will  be  to  evaluate 
the  system  on  a  different  database  containing  a  large  number  of  different 
speakers. 

In  summary,  we  have  found  some  distinct  characteristics  of  nasal  mur¬ 
murs  and  we  are  encouraged  by  the  preliminary  results  of  the  recognition 
task.  We  feel  that  our  study  supports  the  notion  that  a  better  understand¬ 
ing  of  the  acoustic  properties  of  speech  sounds  will  help  us  to  perform  better 
speech  recognition. 
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1  The  authors  ask  to  be  consulted  before  these  results  are  quoted  since  they  were  obtained 
at  such  a  preliminary  stage  ia  the  evaluation. 
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ABSTRACT 

This  study  is  concerned  with  die  quantification  of  acoustic  measures  that 
characterise  nasalized  vowels.  It  is  motivated  by  the  fact  that  the  detection  of 
nasalized  vowels  is  useful  in  speech  recognition,  since  regions  can  be  identified 
in  the  speech  signal  where  a  nasal  consonant  may  be  present,  and  where  the 
vocal  tracl  icsonances  are  distorted.  Our  study  consists  of  several  steps.  First, 
an  acouslie  study  wns  performed  using  utterances  from  a  large  database  in 
order  to  propose  potenlia!  measures  of  nasality  Nest,  automatic  algorithms 
were  developed  to  extract  these  measures,  and  their  utilities  were  established 
through  examincuon  of  a  large  amount  of  data.  Finally,  recognition 
experiments  were  perfomted  using  these  measures.  The  system  detected 
nasalized  vowels  with  an  accuracy  of  approximately  74%.  when  tested  an  one 
speaker  at  a  lime,  and  trained  on  the  speech  of  the  remaining  speakers  in  the 
database. 


INTRODUCTION 

'litis  study  is  concerned  with  the  acoustic  analysis  of  nasalized  vowels 
and  the  subsequent  development  of  algorithms  for  their  recognition.  In 
American  English,  vowels  adjacent  to  nasal  consonants  are  often 
nasalized.  Iltc  presence  of  a  side-branch  for  sound  transmission 
introduces  additional  poles  and  zeros  into  the  vocal  tract  transfer 
function.  As  a  consequence,  the  sitort-time  spectra  of  nasalized  vowels 
often  exhibit  extra  nasal  formants,  or  a  broadening  cf  the  vowel 
formants,  typically  in  the  first  formant  region. 

Iltcrc  arc  several  reasons  why  the  detection  of  nasalization  in  vowels 
would  be  useful  for  automatic  speech  recognition.  First  vowel 
nasalizaiion  provides  important  information  regarding  die  presence  of  a 
nasal  consonant,  especially  in  contexts  where  the  nasal  murmur  lias 
been  shortened  considerably  or  is  absent  altogether,  as  in  words  like 
smack,  or  can'/.  In  these  cases,  die  most  reliable  cue  for  the  presence  of 
a  nasal  consonant  is  often  in  the  degree  of  nasalization  of  adjacent 
vowels.  Second,  nasalized  vowels  present  a  problem  for  formant 
tracking  algorithms,  since  correct  assignment  of  die  spectral  peaks  to 
the  vocal  tract  formant  frequencies  becomes  more  difficult  due  to  the 
pole-zero  interactions.  If  the  regions  of  nasalization  were  knjwn, 
dilfcrcnt  tracking  strategics  could  be  employed. 

Tito  purpose  of  our  study  is  two-fold.  First,  we  would  like  to 
establish  die  acot  Stic  properties  that  distinguish  nasalized  vowels  from 
oral  vowels.  Wc  accomplish  this  by  drawrog  from  knowledge  gained 
front  past  studies  to  propose,  and  evaluate  measures  of  nasality  using  a 
specially  designed  database.  Second,  the  measurements  found  to  be 
reliable  are  incorporated  into  a  recognition  algorithm  for  speaker- 
independent  nasalized  vowel  detection. 

"  Research  supported  by  the  Natural  Sciences  and  bnginccrint  Research  Council  of 
Canada  and  by  the  Office  of  Navai  Research  under  Contzact  N00014-82- K-0727 


There  is  a  vast  amount  of  literature  spanning  the  last  twenty-five 
years  dealing  with  the  analysis,  synthesis,  and  perception  of  nasalized 
vowels.  Extensive  acoustic  studies  have  shown  that  nasalized  vowels  are 
characterized  by  die  presence  of  a  low-frequency  formant  and 
anuformant.  as  well  as  additional  weaker  spectral  peaks  in  the  spectral 
valleys  between  oral  formants  (l,  4|.  The  low-frcqncncy  polc/zcro  pair 
often  broadens  or  splits  the  spectral  peak  associated  with  the  first 
formant.  Synthesis  experiments  based  on  these  findings  have  shown 
dnt  the  perceptually  salient  cue  for  nasalization  appears  to  be  the 
reduction  of  the  relative  prominence  of  the  first  formant  peak  (5). 
These  results,  while  important  in  leading  to  a  belter  understanding  of 
the  acoustic  characteristics  of  nasalized  vowels,  are  often  not  dirccdy 
applicable  to  speech  recognition.  Some  of  die  data  has  not  been 
presented  in  sufficiently  quantitative  form,  and  die  measurements  often 
rely  on  human  intervention  and  interpretation.  In  many  cases,  the  data 
has  been  gathered  from  restricted  environments  such  as  stressed 
consonam-vowel-consonaiti  (CVC)  syllables. 

DATABASE  DESCRIITION 

Our  study  utilizes  a  aa'abase  originally  collcc'td  for  a  separate  study 
of  the  properties  of  nasal  consonants  in  American  English  (2J.  The 
corpus,  containing  over  two  hundred  words,  is  designed  to  include 
nasal  consonants  in  many  different  contexts.  It  contains  nasal 
consonants,  both  in  singletons  and  in  clusters,  that  appear  in  syllabic 
initial,  medial,  and  final  positions.  Many  of  the  words  also  form 
minimal  pairs  such  as  cap/camp,  and  sack/snack.  Words  in  the  corpus 
were  embedded  in  a  carrier  phrase  and  recorded  by  six  speakers,  three 
male  and  three  female,  resulting  in  over  1,700  word  tokens.  All  of  the 
recorded  words  were  digitized  at  16  kHz.  excised  from  the  carrier 
phrase,  and  their  phonetic  transcriptions  aligned  with  the  speech 
waveform  using  the  Spire  system  {7]. 

ANALYSIS  ISSUES 

An  acoustic  study  of  nasalized  vowels  in  American  English  is 
complicated  by  several  issues  First,  speakers  are  free  to  nasalize  a 
vowel  in  any  phonetic  context,  since  nasalized  vowels  arc  not 
distinguished  phoncmically  from  oral  vowels  in  American  English.  As 
an  illustration,  consider  the  spectrograms  shown  in  Figure  1.  The  left 
snd  middle  panels  contain  the  words  mack,  and  back  respectively, 
spoken  by  speaker  A,  wltcrcas  the  right  panel  contains  the  word  mack, 
spoken  by  speaker  B.  Careful  listening  of  the  vowel  indicates  that 
speaker  A  has  nasalized  the  /ae/  in  the  word  back.  This  is  evident  in 
the  spectrogram  oy  the  presence  of  a  low-frequency  spectral  peak 
below  the  first  formant  In  fact  many  speakers  frequently  nasalize  their 
vowels,  irrespective  of  context  To  be  of  use  for  nasal  consonant 
detection,  the  acoustic  study  must  establish  measures  which  can 
automatically  differentiate  between  vowels  in  a  nasal  consonant  context 
and  those  not  in  a  nasal  consonant  context 


Figure  I :  Spectrograms  of  the  words  nack,  hack,  and  mack 


Samples  such  as  those  shown  in  Figure  l  suggest  that  it  is  not 
possible  to  distinguish  nasali7cd  and  orai  .owe Is  by  phonetic  context 
alone.  In  order  to  accurately  determine  die  presence  of  nasalization, 
die  acoustic  signal  must  be  augmented  with  other  physiological 
measurement  indicating  the  degree  of  nasal  coupling.  Without  such 
additional  measurements,  one  can  only  infer  vowel  nasalization  by 
perceptual  experiments,  or  hv  arbitrarily  determined  criteria.  In  this 
study,  iiasnii/.ed  vowels  arc  denned  as  those  rowels  adjacent  to  o  nasal 
consonant,  while  non-nasali/.cd  vowels  arc  'hose  vowels  that  are  not 
adjacent  to  a  nasal  consononl.  nuts,  our  acoustic  study  can  be  viewed 
as  nil  analysis  of  relative  nasalization.  The  underlying  assumption  is 
that  when  vowels  arc  next  to  a  nasal  consonant,  they  are  nasalized  more 
dian  they  would  be  otherwise.  This  definition  is  directed  more  for 
nasal  consonant  detection  than  for  general  nasalization  detection  since 
there  arc  clearly  some  vowels  which  ate  not  in  a  nasal  consonant 
context  which  arc  nasalized.  However  the  results  of  the  acoustic  study 
arc  applicable  to  both  areas. 

Our  definition  for  nasalized  vowels  is  admittedly  inadequate,  since 
according  to  such  a  definition,  vowels  which  arc  separated  from  a  nasal 
consonant  by  an  intervening  consonant  (as  in  film)  will  be  classified  as  a 
non-nasalized  vowel,  whereas  in  nil  probability  these  vowels  will  be 
nasalized.  Since  the  nature  of  these  vowels  is  somewhat  ambiguous,  we 
have  chosen  to  eliminate  such  tokens  from  our  database  in  order  to 
reduce  die  amount  of  noise  tJicy  might  cause  in  the  measurement 
distributions.  This  excluded  about  200  vowels  from  die  acousuc 
analysis. 

A  second  difficulty  with  a  study  of  nasalized  vowels  is  that  different 
speakers  nasalize  to  various  degrees.  Thus,  one  person's  nasalized 
vowel  could  have  die  same  characteristics  as  another's  non-nasalized 
vowel.  Again,  referring  to  Figure  I,  we  sec  diat  the  strong  cues  for 
nasalization  in  speaker  A  arc  barely  present  for  speaker  D.  This 
phenomenon  smears  measurement  distributions,  and  compounds  the 
difficulties  associated  with  speaker-independent  nasalized  vowel 
detection. 

In  our  acoustic  analysis,  several  procedures  were  adopted  to  control 
die  influence  of  interspeaker  variability.  The  analysis  is  conducted  on  a 
spcakcr-byspcaker  basis  in  order  to  eliminate  the  speaker-dependent 
nature  of  vowel  nasalization.  In  fact,  the  initial  portion  of  the  analysis  is 
restricted  to  observi'-*  relative  differences  between  minimal  word  pairs 
such  as  skip/ skimp  for  each  speaker,  thus  maximizing  the  opportunity 
for  us  to  observe  acoustic  contrasts  between  nasalized  and  non- 
nasalized  vowels. 

A  third  factor  complicating  the  acoustic  analysis  is  the  inherent 
dynamic  quality  of  nasalized  vowel  spectra.  Sometimes  the  spectral 


change  is  due  to  the  dme  course  and  the  varying  degree  of  nasal 
coupling  throughout  'he  vowel  production.  Other  times  the  dynamic 
characteristics  can  be  attributed  to  the  movements  of  other  articulators, 
such  as  during  the  production  of  dipthongs.  In  either  of  these  cases,  the 
net  effect  is  that  the  acoustic  characteristics  change  with  dme. 

While  it  may  be  possible  to  track  significant  characteristics  of  the 
vowel  (such  as  the  resonance  frequencies)  as  a  function  of  time,  this 
method  was  not  used  because  it  was  felt  that  such  systems  may  be 
fragile,  especially  in  nasalized  vowels.  Instead,  each  vowel  was  divided 
into  subsegments,  so  that  averaging  procedures  could  be  used  in  each 
subsegment  to  reduce  measurement  noise.  At  die  same  ume  however, 
chaijgta  between  the  different  subsegments  of  the  vowel,  caused  by 
increasing  nasalization,  would  still  be  measurable.  After  some 
experimentation  it  was  decided  to  use  three  suhsegments  in  each  vowel. 
Thus,  whenever  a  measurement  of  some  parameter  was  made  on  a 
vowel,  there  were  three  values  reiun.ed.  F-ach  value  represented  an 
average  of  the  parameter  in  one  of  the  three,  equally  spaced,  vowel 
subsegments.  An  added  benefit  of  such  a  procedure  is  that,  by 
comparing  the  measurements  in  the  initial  and  final  portions,  one  may 
be  able  to  determine  whether  die  vowel  is  preceded  or  followed  by  a 
nasal  consonant. 


ACOUSTIC  STUDY 

In  die  acoustic  analysis,  die  goal  was  to  establish  differences  between 
nasalized  and  non-nasalized  vowels  b'  comparing  some  form  of  average 
spectra  of  selected  tokens  from  the  database.  On  the  basis  of  these 
observations,  general  discriminating  properties  could  be  proposed  and 
quantified  using  ail  utterances  of  die  database.  Ihc  following  sections 
describe  the  steps  followed  for  the  spectral  analysis.  A  more  detailed 
description  of  the  acousuc  analysis  may  be  found  in  [3). 

■Spectral  Averaging 

Fcf  analysis,  the  speech  signal  was  pre-emphasized,  and  spectra  were 
computed  from  a  windowed  ccpstrom  [6|.  The  spectra  were  all 
normalized  with  respect  to  total  energy  so  that  individual  energy  offsets 
were  eliminated.  Statistics  were  gathered  by  averaging  multiple  spectra 
for  nasalized  and  non-nasal’^d  vowels  of  each  sneaker.  Figure  2  shows 
average  spectra  for  an  /ac /  for  a  male  speaker.  The  left  panel  presents 
a  statistical  summary  of  d  e  normalized,  smoothed  spectra  of  the  non- 
nasalized  /ac/  of  a  m?,e  speaker.  The  right  panel  presents  the 
corresponding  nasalized  /ae/  of  the  same  speaker.  The  average  spectral 
shape,  shown  by  the  dark  line,  is  surrounded  by  lines  which  represent 
one  standard  deviation  from  the  mean. 


Figure  Z:  Average  Spectral  Shape  of /ae/ 


Although  the  averaging  procedure  can  potentially  smear  useful 
information,  we  nevertheless  found  these  displays  very  informative.  By 
comparing  the  vowel  spectra  such  as  those  in  Figure  2,  we  were  able  to 
establish  general  characteristics  of  nasalized  vowels.  For  example,  we 
found  that  the  most  noticeable  difference  between  die  nasalized  and 
non-nasalized  vowels  was  in  the  low  frequency  regions  of  the 
magnitude  spectrum.  Typically,  non-nasalized  vowels  had  one 


resonance  tn  the  first  formant  region,  while  nasalized  vowels  had  two. 
However,  many  non-nasalized  vowels  also  had  an  extra  resonance  in 
the  low  frequency  region,1  Ihus,  it  is  not  always  possible  to  distinguish 
nasalized  from  non-nasalized  vowels  by  simply  measuring  how  often 
there  is  an  extra  resonance  in  the  first  formant  region  of  the  vowel. 

Another  characteristic  of  nasalized  vowels  is  that  the  extra  resonance 
is  noticeably  more  distinct,  and  is  manifested  in  the  spectrum  in  at  least 
two  ways.  First,  the  magnitude  of  the  extra  resonance  may  increase 
relative  to  the  first  resonance  of  the  vowel.  This  can  be  caused  by  the 
first  resonance  decreasing  in  amplitude,  or  the  extra  resonance 
increasing,  or  both  Second,  the  valley  between  the  extra  resonance  and 
the  first  resonance  may  deepen.  Thus,  even  if  a  non-nasalized  and  a 
nasalized  vowel  both  happened  to  have  an  extra  resonance,  it  may  still 
be  possible  to  discriminate  between  them  by  measuring  die  strength  of 
the  extra  resonance  relative  to  the  first  formant. 

Another  observed  characteristic  of  nasality  was  a  smearing  of  the  first 
resonance  itself.  In  fact,  when  an  extra  resonance  was  not  present,  as 
ot  casionallv  observed  in  a  nasalized  vowel,  a  measure  of  the  spread  of 
energy  about  die  first  resonance  was  found  to  be  the  best  indication  of 
nasalization. 

By  observing  die  characteristics  of  these  averaged  spectra  on  a  subset 
of  die  database,  we  were  able  to  propose  measures  that  may  signify  the 
acoustic  contrasts  between  the  two  classes  of  vowels.  Due  to  the 
variability  of  the  environment,  none  of  the  observed  acoustic 
characteristics  was  present  in  a  nasalized  vowel  at  all  dmes.  However, 
taken  in  combination,  we  felt  dial  these  properties  could  help 
discriminate  nasalized  vowels  from  non-nasalized  ones. 

The  next  step  of  our  acoustic  study  was  to  formalize  our  observations 
into  a  set  of  specific  algorithms  for  automatic  feature  extraction,  and  to 
validate  and  quantify  these  measures  by  examining  its  statistical 
behavior  on  die  enure  da'abasc.  Figure  3  illustrates  a  typical  result  of 
the  analysis  which  compares  a  measure  of  the  spectral  spread  of 
nasalized  vowels  versus  non-nasalized  vowels.  The  spread  was 
calculated  by  computing  the  standard  deviation  of  a  spectral  first 
moment,  computed  below  1000  Hz  |3|.  In  the  display,  die  dark  lines 
arc  die  distributions  of  nasalized  vowels  (695  samples),  while  the 
dashed  lines  arc  the  distribution.!  of  non-nasalized  vowels  (500 
samples).  All  values  are  in  Hz.  We  can  see  that  for  this  measure,  the 
two  classes  of  vowels  have  different,  although  overlapping  distributions. 


Figured:  Histogram  .  Standard  Dev  aden 


RECOGNITION  EXPER’ivlENT 

Alter  establishing  some  of  the  important  acoustic  characteristics  of 
nasalized  vowels,  preliminary  investigauons  were  conducted  to  evaluate 
die  potential  use  of  these  properties  in  speech  recognition,  Admittedly, 
diese  experiments  do  not  realistically  reflect  the  utility  of  the 
measurements,  since  the  evaluations  were  made  on  the  same  database 
as  die  acoustic  study.  However,  they  do  provide  an  indication  of  their 
potential  for  use  in  speaker  independent,  speech  recognition  systems. 


EuLBuk 

We  have  structured  the  recognition  experiment!  as  a  set  of 
dixritninaiion  tests.  Thus  in  a  typical  experiment,  the  nasalized  vowel 
detection  system  is  given  a  test  token  and  training  data.  The  system 
must  then  classify  the  token  as  cid.er  next  to  a  nasal  consonant 
(nasalized),  or  not  next  to  a  nasal  consonant  (non-nasalized). 
Throughout  these  experiments,  it  is  assumed  dial  the  boundaries  of  the 
vowels  are  known,  although  no  knowledge  of  the  presence  or  absence 
of  a  nasal  murmur  is  used. 

Hit  Stnittgi 

Our  acoustic  study  produced  several  parameters,  each  potentially 
useful  in  characterizing  a  certain  aspect  of  nasalized  vowels.  We  have 
chosen  to  incorporate  all  diese  measurements  into  detection  systems  for 
the  task  in  hand.  Thus  a  given  test  token  is  associated  with  a  set  of  n 
values,  corresponding  to  a  set  of  acoustic  measurements  made  on  the 
test  token.  If  we  consider  the  set  of  values  as  a  vector  in  an  n- 
dimensional  space,  we  arc  faced  with  a  muludimcnsional  decision¬ 
making  problem. 

Although  there  arc  several  possible  decision  making  procedures 
available,  a  sum  of  individual  log  likelihoods  was  found  to  be  simple, 
and  effective.2  Using  diis  technique,  each  parameter  returns  the  log 
ratio  of  die  likelihood  that  the  token  is  nns  ized  to  die  likelihood  that 
the  token  is  non-nasalized.  Likelihoods  are  established  by  using 
normalized  histograms  of  the  measurements  based  on  the  training  data 
provided  to  the  systems.  Bin  widths  of  the  histograms  were  manuaiiy 
set  to  ensure  that  the  distributions  would  be  reasonably  shaped. 

The  Experiment 

As  an  initial  evaluation,  the  detection  system  was  tested  with  the 
utterances  of  the  database.  There  were  a  total  of  685  nasalized  vowels 
and  500  non-nasalized  vowels. 

In  order  to  approximate  a  speaker-independent  task  g-ven  the  limited 
amount  of  available  data,  the  system  was  evaluated  using  a  rotauonal 
procedure.  In  each  step,  system  was  trained  on  the  data  from  live  of  the 
six  speakers  in  the  datahase.  and  was  tested  on  the  data  from  the  sixth 
speaker. 

Six  measures  from  the  acoustic  study  were  incorporated  into  the 
detection  system.  Each  measure  was  taken  from  one  of  the  three  vowel 
suhregions  and  was  usually  a  maximum  or  minimum  of  the  three 
values.  The  six  measures  were: 

1.  Center  of  Mass.  The  average  center  of  mass  in  the  middle 
subregion. 

2.  Standard  Deviation.  The  maximum  value  of  the  average 
standard  deviation. 

3.  Maximum  Resonance  Percentage.  The  maximum 
percentage  of  the  lime  there  is  an  extra  resonance  in  die  low 
frequency  region. 

4.  Minimum  Resonance  Percentage.  The  minimum  percentage 
of  the  time  there  is  an  extra  resonance  in  the  low  frequency 
region. 

5.  Maximum  Resonance  Dip.  (he  maximum  value  of  the 
average  dip  between  the  fiist  resonance  and  the  extra 
resonance. 

6.  Minimum  Resonance  Difference.  Tnc  minimum  value  of  the 
average  difference  between  the  first  resonance  and  the  extra 
resonance. 


Using  the  circular  evaluation  procedure  described  earlier,  we 
performed  ihc  recognition  experiments  on  the  speech  of  all  of  the 
speakers  in  our  database.  The  results  of  our  experiments  are 
summarized  in  Table  1.  While  the  results  vary  as  a  function  of  the 
recognition  experiments,  there  are  several  nds  that  are  worth  noting. 
Across  all  speakers,  an  average  nasalization  detection  rate  of  74%  was 
obtained.  In  all  but  one  of  die  experiments  summarized  in  Table  1.  the 
system  is  better  at  detecting  nasalized  vowels  than  non-nasalized  ones. 
Iliis  is  perhaps  a  reflection  of  the  emphasis  dial  we  have  placed  on  the 
discovery  of  acoustic  features  that  characterize  nasalized  vowels,  as 
opposed  to  oral  vowels.  Another  striking  result  is  dial  die  system 
performed  significantly  better  for  males  than  for  females.  The  system 
recognized  high  vowels  hetter  than  low  vowels,  a  consequence  of  the 
fact  that  die  detection  rare  for  low  vowels  spoken  by  females  is  quite 
poor. 


Evaluation 

Detection  Rata 

Nasalised 

.Von  Nasalised 

Average  | 

All 

91 

97 

74 

Milt 

93 

78 

91 

Fmale 

66 

60 

83 

High 

9] 

75 

79 

Low 

7S 

63 

89 

Male  High 

92 

7S 

79 

Male  Low 

99 

93 

85 

Female  High 

74 

71 

73 

Female  Low 

59 

87 

61  | 

Table  1:  Nasalized  Vowel  Detection  Rites 


While  74%  correct  is  better  than  chance,  it  still  leaves  a  large  number 
of  vowels  for  which  no  confident  statement  may  be  made  about 
whether  they  arc  nasalized  or  not.  This  is  primarily  due  to  the  fact  that 
different  speakers  nasalize  to  varying  degrees,  and  that  the  acoustic 
characteristics  of  nasalized  vowels  depend  somewhat  on  the 
characicristics  of  the  vowel.  Thus,  by  attempting  to  operate  in  a 
speaker-independent  mode,  the  individual  distribuuons  are  being 
smeared.3 


nasality  too  well  when  presented  only  with  an  isolated  vowel.  In  a 
perceptual  experiment  that  we  conducted  using  a  subset  of  the 
database,  it  was  found  that  human  performance  is  about  the  same  as 
that  of  our  recognition  system  on  the  same  data. 

The  system  was  also  clearly  hampered  by  our  definition  of  a  nasalized 
vowel  since,  as  has  Jecn  pointed  out.  vowels  may  be  nasalized  in  any 
context.  If  die  task  of  the  system  were  to  detect  truly  nasalized  vowels 
rather  than  vowels  adjacent  to  a  nasal  consonant,  the  performance 
would  probably  improve. 

SUMMARY 

The  detection  of  nasalizadon  in  vowels  is  hoth  an  important  and  a 
difficult  problem  in  phonetic  recognition.  It  is  important  bocause  of 
the  potendal  benefits  that  one  can  derive  from  their  successful 
detection,  as  discussed  earlier.  For  the  purposes  of  assisung  die 
detccdon  of  nasal  consonants,  it  is  difficult  because  nasalized  vowrisare 
not  disdnguislicd  phonemically  from  ora*  vowels  in  American  English 
Thus  speakers  are  free  to  nasalize  vowels  to  various  degrees,  and  may 
nasalize  a  vowel  in  any  phonede  context 

In  this  study,  we  have  established  a  set  of  acoustic  measures  that 
characterize  different  aspects  of  the  average  spectra  of  nasalized  vowels. 
Algorithms  for  extracung  these  measures  automatically  from  die 
acoustic  signal  have  been  developed,  and  a  set  or  recognition 
experiments  were  performed  using  these  measurements.  Our  results 
suggest  that  vowel  nasalization  can  be  detected  with  moderate  success, 
although  the  recognition  experiments  should  he  validated  by  using  a 
large  amount  of  new  data  from  unknown  speakers.  In  addiuon, 
information  regarding  the  presence  of  an  adjacent  nasal  murmur  may 
also  prove  to  be  helpful. 

NOTES 

1.  The  vowels  produced  by  female  speakers  can  have  a  low  resonance  in  any 
context.  This  property  is  due  to  brcaihiness  more  than  nasalization. 

2.  Gaussian  decision  making  techniques  and  binary  tree  classifiers  were  also 
examined. 


Ihc  findings  of  our  acoustic  study,  as  well  diosc  by  other  researchers, 
have  shown  that  the  extra  resonance  tends  to  be  more  distinct  in  low 
vowels  than  in  high  vowels,  and  speakers  tend  to  nasalize  low  vowels 
more  readily  than  high  vowels.  Thus,  one  would  expect  to  be  able  to 
delect  nasalization  more  successfully  in  low  vowels,  as  confirmed  by  our 
recognition  experiment  for  male  speakers. 

The  poor  detection  scores  for  female  speech  is  perhaps  also 
undcrsiaiidablc.  since  there  is  often  an  extra  low  resonance  in  the 
sonorant  regions.  Given  that  the  extra  resonance  is  a  major  acoustic 
difference  between  nasalized  and  non-nasalized  vowels,  it  is  natural  to 
expect  system  performance  to  deteriorate  when  the  low  resonance  is 
present  in  the  speech  signal  irrespective  of  nasality.  It  is  interesting  to 
note  tliai  for  female  speakers,  die  system  was  able  to  idendfy 
nasalization  in  high  vowels  better  than  in  low  vowels.  Since  the  low 
resonance  of  female  speakers  is  always  below  the  first  formant,  high 
vowels  which  have  a  nasal  resonance  above  the  first  formant  are 
uniquely  nasal,  and  so,  may  be  idenufied  correctly. 

Finally,  it  should  be  noted  that  our  detccdon  algorithms  made  no  use 
of  inf  irmation  regarding  the  presence  of  an  adjacent  nasal  murmur.  In 
a  separate  study,  we  have  found  that  nasal  murmurs  can  be  detected 
wtih  high  reliability  |2|.  It  Is  very  likely  that  recognition  results  can  be 
improved  substantially  when  this  further  source  of  knowledge  is 
incorporated.  While  die  results  arc  only  moderately  successful,  we  were 
nevertheless  comforted  by  the  fact  dial  human  listeners  do  not  perceive 


3.  Some  earlier  speaker-dependent  experiments  obtained  detection  rates 
over  10%  better  than  dtose  reported  here. 
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The  Effect  of  Speech  Rate  on  the  Application  of 
Low-Level  Phonological  Rules  in  American  English 


Introduction 

This  paper  is  concerned  with  the  effect  of  speech  rate  on  the  acoustic  character¬ 
istics  of  speech  sounds.  The  primary  goals  of  the  study  are  to  improve  our  under¬ 
standing  of  the  changes  in  the  continuous  speech  signal  as  a  function  of  speech  rate, 
thereby  leading  to  models  that  will  account  for  these  changes.  Our  hope  is  also  that 
a  fundamental  understanding  of  this  sort  will  benefit  efforts  in  developing  speech 
synthesis  and  recogni’ion  systems  that  must  deal  with  variability  in  speech  rate.  We 
should  note  at  the  onset  that  the  effects  of  speech  rate,  from  articulatory,  acoustic, 
and  perceptual  standpoints,  have  been  studied  by  many  other  researchers.  Most  of 
these  studies,  however,  either  have  dealt  with  some  of  the  more  global  properties, 
such  as  the  frequency  of  pause  insertion,  or  have  not  been  concerned  with  natural 
continuous  speech.  Our  focus  is  somewhat  different.  Specifically,  we  are  interested 
in  investigating  changes  in  the  relative  frequency  of  application  of  certain  low-level 
phonological  rules,  such  as  happing  and  palatalization.  Furthermore,  we  are  inter¬ 
ested  in  quantifying  changes  in  some  of  the  segmental  cues  as  a  function  of  speech 
rate.  Our  analysis  of  the  data  is  not  complete.  This  paper  should  be  viewed  as  a 
progress  report. 

Data  Collection  and  Analysis 

The  corpus  that  we  used  consists  of  a  short  paragraph  containing  47  words  in 
4  sentences.  It  is  especially  designed  such  that  many  low-level  phonological  rules 
may  be  applied  at  most  of  the  word  boundaries.  In  fact,  a  speaker  has  the  option 
of  applying  one  or  more  rules  at  35  out  of  the  possible  43  word  boundaries.  The 
rules  include  palatalization,  glottalization  before  vowel  i,  gemination,  and  alveolar 
flapping.  Some  examples  of  these  rules  are  shown  below. 


Palatalization 
Gemination 
Flapping 
Schwa  Devoicing 
Glottal  Stop  Insertion 


Could  you  ... 
Advertitememt  teem  ... 
What  ever  ... 

...  to  be  ... 

..  such  !idj. 


The  paragraph  was  recorded  by  four  speakers,  two  male  and  two  female,  at  three 
different  rates:  fast,  normal,  and  slow.  Since  it  is  difficult  to  control  the  absolute 
speech  rate  from  speaker  to  speaker,  v;e  decided  to  solicit  from  all  speakers  fast  and 
slow  readings  relative  to  their  normal  speech  rate.  The  recording  procedures  were  as 
follows.  First,  the  speakers  were  asked  to  read  the  paragraph  several  times  at  their 
normal  rate.  The  average  duration  of  the  reading  was  measured  with  a  digital  clock. 
For  the  slow  reading,  the  clock  was  set  to  twice  the  time  of  the  normal  reading,  and 
the  speakers  were  asked  to  complete  the  reading  in  the  allotted  time.  For  the  fast 
reading,  we  had  originally  hoped  to  gather  data  at  twice  the  normal  rate.  However, 
it  became  clear  early  on  that  people  have  extreme  difficulty  speaking  at  twice  their 
usual  rate  without  significantly  affecting  intelligibility.  As  a  result,  we  modified  our 
procedure  and  asked  the  speakers  to  complete  the  reading  in  three-fourths  the  time 
of  their  normal  reading.  A  minimum  of  two  readings  for  each  rate  was  recorded. 
All  in  all,  our  database  contains  48  paragraphs,  a  total  of  slightly  over  2,200  word 
tokens. 

Figure  1  summarizes  the  actual  speech  rate  measured  in  terms  of  the  number  of 
syllables  per  second  of  speech,  for  each  experimental  condition  and  for  each  of  the 
speakers.  We  see  that  for  both  the  fast  and  the  slow  speaking  conditions,  speakers 
can  indeed  produce  speech  at  the  desired  rate.  The  average  number  of  syllables 
per  second  is  2,4,  4.8,  and  6.0,  for  the  slow,  normal,  and  fast  reading  conditions, 
respectively. 
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Figure  1:  Average  Speaking  Raic  for  the  Three  Conditions 


All  the  recorded  utterances  were  digitized  at  16  kHz  using  the  Spire  facility  de¬ 
veloped  at  MIT,  and  digital  spectrograms  for  the  utterances  were  made.  Phonetic 
transcriptions  for  the  utterances  were  obtained  independently  by  the  two  experi¬ 
menters.  This  wa s  done  both  by  listening  to  the  utterances  and  by  visually  examin¬ 
ing  the  spectrograms  and  other  relevant  displays.  The  experimenters  discussed  the 
differences,  which  were  often  minor,  and  a  consensus  was  reached.  This  transcrip¬ 
tion  was  then  time-aligned  with  the  speech  waveform  by  hand.  Most  of  the  data 
was  subsequently  analyzed  using  SpireX ,  a  data  analysis  and  statistics  gathering 
program  on  our  Lisp  Machine  workstations. 


Word  Boundary  Effects 

As  explained  earlier,  we  desig  ned  the  short  paragraph  such  that  low-level  phono¬ 
logical  rules  can  be  applied  at  many  of  the  word  boundaries.  The  analysis  of  our 
data  indicates  that  there  is  a  general  tendency  for  the  frequency  of  rule  application 
to  be  correlated  with  speech  rate. 

Figure  2  gives  some  examples.  In  the  left-hand  column,  we  compare  the  spec¬ 
trogram  of  a  portion  of  the  phrase  “advertisements  seem*  spoken  by  i  male  speaker 
at  the  slow  and  fast  rates  In  the  top  panel,  the  two  /s/  sounds  are  geminated, 
whereas  in  the  bottom  panel,  a  long  pause  is  inserted  between  the  two  /s/’s.  In  the 
right-hand  column,  we  cc  ire  fast  and  slow  readings  of  the  phrase  “what  ever.* 
In  this  case,  the  word  final  [if  is  turned  into  a  fiap  for  the  fast  condition,  whereas  a 
pause  and  a  glottal  stop  are  inserted  following  a  weak  release  for  the  slow  condition. 


Advertisement*  seetn  — 


Wh*»  ev«  . 


Figure  2:  Spectrogram j — Slow  vs.  Fait  Rate 


Our  findings  for  the  frequency  of  application  of  low-level  phonological  rules  are 
summarized  in  Figure  3. 
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Figure  3:  Frequency  of  Rule  Occurrence 

The  top  graph  describes  those  rules  that  occur  more  frequently  as  the  speech 
rate  increases.  We  see  that  for  palatalization  and  gemination,  there  is  a  dramatic 
increase  in  the  frequency  of  rule  application  from  slow  to  fast.  In  fact,  these  two  rules 
are  often  applied  as  frequently  in  normal  speech  as  they  are  in  fast  speech.  Flapping 
and  schwa  devoicing,  on  the  other  hand,  do  not  occur  nearly  as  frequently  as  the 
other  two.  For  these  two  rules,  the  application  at  the  fast  rate  varies  from  speaker 
to  speaker.  Schwa  devoicing,  for  example,  does  not  occur  in  significant  number  until 
the  fast  condition,  and  as  such  is  dominated  by  two  of  the  four  speakers. 

The  bottom  graph  show3  those  rules  whose  frequency  of  occurrence  is  negatively 
correlated  with  the  speech  rate.  At  the  slow  rate,  three  of  the  speakers  inserted 
pauses  at  almost  all  the  word  boundaries,  and  the  speech  is  read  as  if  the  sentences 
were  strings  of  isolated  words  concatenated  together.  When  pauses  are  inserted 
before  a  word  that  starts  with  a  vowel,  a  glottal  stop  is  often  observed.  Again,  the 
frequency  of  glottal  stop  insertion  is  about  the  same  for  normal  and  fast  speech. 


Segmental  Effects 

We  now  turn  our  attention  to  the  second  issue,  namely  the  segmental  changes 
due  to  changes  in  speech  rate.  For  this  presentation,  we  will  limit  our  discussion 
to  the  durational  changes  for  various  speech  sounds.  In  general,  the  segmental 
durations  increase  as  the  speech  rate  decreases. 

This  is  illustrated  in  the  next  figure,  which  compares  the  vowel  durations  for  the 
three  experimental  conditions.  We  see  that  the  histogram  for  the  fast  rate  is  similar 
to  that  of  the  normal  rate.  The  decrease  in  vowel  duration  is  small,  presumably 
due  to  to  the  incompressibility  of  segment  durations  as  suggested  by  Klatt  and 
others.  Focusing  now  on  the  slow  condition,  we  see  that  vowels  can  be  lengthened 
significantly  when  speech  rate  is  reduced. 
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Figure  4:  Histogram  of  Vowel  Duration  for  the  Three  Conditions 

While  the  general  trends  suggested  in  this  figure  hold  true  for  all  the  speech 
sounds,  the  amount  of  change  varies  from  sound  *0  sound.  Figure  5  summarizes  for 
several  classes  of  speech  sounds  the  percentage  change  in  the  average  duration,  as 
compared  to  the  normal  condition,  for  the  fast  and  slow  conditions. 

As  can  be  seen  from  this  figure,  the  amount  of  durational  increase  for  the  slow 
condition  is  greater  for  vowels  and  nasals,  and  smaller  for  steps  and  fricatives.  This 
is  presumably  due  to  the  fact  that  the  airflow  is  greater  for  the  turbulent  sounds, 
making  them  harder  to  sustain.  In  contrast,  the  decrease  in  duration  for  the  fast 
condition  is  around  14%. 


Figure  5:  Percentage  Change  in  Segmental  Directions 
(Relative  to  the  Normal  Rate) 

The  amount  of  change  in  segment’ll  duration  is  significantly  less  than  what  the 
the  actual  speech  rate  would  predict.  This  is  due  to  the  fact  that  the  predominance 
of  the  overall  durational  changes  can  be  accounted  fc  by  pauses.  The  fast  condition 
contains  one-third  fewer  pauses  than  the  normal  condition.  Since  the  pauses  are 
considerably  longer  than  speech  sounds,  a  decrease  in  the  proportion  of  pauses  fur¬ 
ther  increases  th  j  overall  speech  rate.  For  the  slow  condition,  there  are  over  three 
times  as  many  pauses  inserted  between  words.  As  a  result,  the  overall  speech  rate 
is  further  reduced.  While  the  number  of  pauses  increases  significantly  from  the  fast 
to  slow  conditions  this  figure  shows  that  the  average  duration  of  a  pause  remains 
relatively  unchanged  for  all  three  conditions. 


In  summary,  our  analysis  indicates  that  for  most  low-level  phonological  rules, 
their  relative  frequency  of  application  increases  when  the  speech  rate  is  increased 
from  the  speaker’s  normal  rate,  in  contrast,  speakers  often  adopt  different  strategies 
for  slowing  down  their  speech,  including  the  insertion  of  pauses  and  the  release  of 
word-final  stops,  such  that  the  frequency  of  application  of  the  rules  varies  from 
speaker  to  speaker. 

When  the  speech  rate  is  faster  than  normal,  our  results  indicate  that  the  segmen¬ 
tal  durations  are  decreased  almost  uniformly.  This  is  accompanied  by  a  reduction 
in  the  number  of  pauses  in  the  sentences.  \  hen  the  speech  rate  is  slower  than  nor¬ 
mal,  the  increase  in  segmental  duration  ap;  sars  to  vary  from  sound  to  sound.  Our 
result  is  in  agreement  with  pre/ious  work  by  Goldman- Eisler,  Huggins,  Grosjean, 
and  others,  that  in  slow  speech  there  is  a  sharp  increase  in  the  number  of  pauses 
in  a  sentence,  with  each  word  taking  on  the  appearance  of  an  isolated  word.  While 
the  number  of  pauses  varies  as  a  function  of  speech  rate,  the  average  duration  of  a 
pause  remains  unchanged. 
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Introduction 

I’,  many  areas  cf  speech  research,  ranging  from  speech  analysis 
and  synthesis  to  recognition,  researchers  are  often  faced  with  a 
common  set  of  analysis  procedures.  Specifically,  there  is  often  the 
^ed  to: 

•  Record  and  digitize  utterac  .0, 

•  Create  various  attributes  of  the  speech  signal,  and 


properties  of  the  attributes,  as  well  as  how  they  are  computed.  4s 
a  result,  attributes  can  be  displayed  conveniently,  and  they  can 
also  be  used  to  compute  new  attributes. 

Spirt  displays  are  organised  in  the  form  of  layout*.  A  layout 
is  a  collection  of  displays  that  the  user  can  compose  freely  So  suit 
his/her  research  needs.  Layouts  that  are  use''  frequently  can  be 
predefined  and  saved  for  future  usage.  For  ie,  the  recording 
layout  and  the  transcription  layout  are  pro\..  I  by  Spin,  since 
they  are  almost  always  needed  by  a  user,  and  serve  as  example 
for  beginners.  Many  of  the  commands  in  Spin  are  executed  by 


•  Display  and  perform  interactive  measurements  on  these  var- 
£  ions  attributes. 

The  ease  with  which  one  can  perform  these  tasks  greatly  fa¬ 
cilitates  the  gathering  of  information  and  the  corresponding  im¬ 
provement  of  onr  speech  knowledge.  5ptrc  (Speech  and  Phonetics 
Interactive  Resiarch  Environment)  represents  our  attempt  to  pro¬ 
vide  the  answer  to  suen  research  needs  at  MIT. 

"  Spin  was  originally  designed  and  implemented  by  David  W. 
Shipman  [Shipman,  1982|.  Since  1983,  however,  the  system  has 
undergone  considerable  modification  by  the  second  author  of  this 
paper,  DSC.  It  is  an  evolving  system  that  is  still  being  actively 
improved.  T  his  paper  serves  as  a  progress  report  o"  the  present 
status  of  Spin. 

• 

Hardware  Requirements 

The  speech  workstation  that  we  have  developed  centers  arom-i 
a  Lisp  Machine,  originally  developed  at  the  M.LT.  Artificial  In¬ 
telligence  Laboratory.  It  is  specifically  designed  for  the  efficient 
fecution  of  Lisp,  a  symbolic  programming  language  widely  used 
in  the  artificial  intelligence  community.  The  current  version  of 
Spin  nuts  on  a  Symbolics  3600  or  3670  Lisp  Machine.  The  Lisp 
Machine  has  2  Mbytes  of  main  memory  and  a  1  Gbytes  address 
space,  a  474  Mbyte  die*  and  a  Spin  Unibus  interface.  The  Lisp 
Machine’s  high-resolutic  graphics  console  and  hand-held  pointer 
allow  development  of  extremely  convenient  user  interfaces. 

|  We  have  augmented  the  standard  configuration  of  the  Lisp  Ma¬ 
chine  with  an  FPS-100E  array  processor  (up  to  4  Mflops),  which 
handles  essentially  all  the  computationally  intensive  numeric  pro¬ 
cesses.  The  work  station  also  includes  a  DSC-200/240  A/D  and 
D/A  converter  and  audio  amplifier,  a  Versatec  V-80  electrostatic 
printer/plotter,  and  assorted  audio  equipment  such  as  a  micro- 
£jone,  a  set  jf  headphones,  and  1  tape  recorder.  The  Lisp  Ma¬ 
chine  work  stations  are  connected  to  one  another  and  to  central 
file  servers  via  a  packet-switched  local  area  network. 

System  Description 

^  Spin  organizes  an  utterance  as  a  collection  of  attributes.  The 
attributes  may  be  symbolic  (e.g.  phonetic  transcription),  or  they 
may  be  numeric  (e.g.  RMS  amplitude).  Some  of  the  attributes  are 
one  dimensional  (e.g.  speech  waveform),  while  others  are  multi¬ 
dimensional  (e.g.  short-time  spectra).  Spin  has  knowledge  of  the 


means  of  the  hard- held  mouse  pointer.  The  mouse  can  be  used 
to  configure  a  layoui,  to  play  back  a  eviction  of  the  utterance, 
edit  waveforms,  examine  data  values,  alter  display  options,  and 
perform  other  functions. 

For  the  remainder  of  this  section,  we  will  give  some  examples  of 
the  operation  and  capabilities  of  Spin.  For  a  detailed  description 
of  Spire,  see  Shipman  [1982|,  and  Cyphers  (1S85|. 

Recording 

Figure  1  shows  the  recording  layout  0.  Spin.  The  default  sampling 
rate  is  either  16  kHs  or  2^  kHs,  although  it  ran  b<  as  high  as 
70  kSs.  Appropriate  anti-aliasing  filters  can  be  selected  by  the 
user.  Information  about  the  talker,  sampling  rate,  filename,  and 
the  orthographic  transcription  can  all  be  changed  easily  with  a 
cT  k  of  the  mouse.  Alternatively,  an  agenda  file  can  be  set  up 
to  sequentially  change  these  parameters.  This  latter  option  is 
particularly  useful  for  bulk  data  input  when  a  list  of  the  utterances 


Figure  l:  The  Spin  recording  layout 


1 


'•  already  exists  on-line. 

Currently,  Spirt  can  accept  an  essentially  unlimited  amount  of 
>  speech  at  a  stretch.  An  automatic  end-point  detector  attempts  to 

E  locate  the  utterance.  The  user  can  listen  to  the  located  utterance, 
|  modify  the  endpoints  if  necessary,  and  accept  the  utterance  into 
the  database,  all  with  several  clicks  of  a  mouse  button. 


Transcription 

Figure  2  shows  the  transcription  layout  of  Spirt.  This  layout  is 
used  to  enter  the  phonetic  transcription,  and  time- align  it  (or  the 
orthographic  transcription)  with  the  speech  waveform.  For  pho¬ 
netic  alignment,  the  region  of  the  waveform  bounded  by  the  cursor 
(shown  as  the  solid  vertical  line  in  the  spectrogram  and  waveform 
displays)  and  the  marker  (shown  as  the  dotted  line  in  the  same 
displays)  can  be  associated  with  a  phonetic  symbol  by  a  elicit  of 
a  mouse  button.  Using  this  layout,  an  experienced  acoustic  pho¬ 
netician  can  align  a  two  second  utterance  in  about  5  minutes. 

While  manual  time-alignment  using  Spirt  is  quite  efficient,  it 
nevertheless  requires  the  expertise  of  a  small  group  of  experts.  As 
a  result,  the  amount  of  data  that  can  be  collected  and  aligned 
is  greatly  limited.  In  addition,  phonetic  alignment  is  often  sub¬ 
jective,  thus  leading  to  inconsistencies  among  transcribers.  The 
tedious  nature  of  the  task  also  tends  to  introduce  human  errors. 
We  have  recently  developed  a  semi-automatic  system,  extending 
the  basic  capabilities  of  Spire,  to  perform  the  time  alignment.  The 
results  of  our  preliminary  evaluation  have  been  encouraging.  For 
a  description  of  the  alignment  system,  see  Leung  and  Zue  [1984) 


and  Leung  (1985). 

Other  Features 

One  of  the  most  important  features  of  Spirt  is  that  a  user  can 
compose  his/her  own  layouts  for  the  specific  research  needs.  Fig¬ 
ure  3  shows  an  example  of  such  a  layout.  The  figure  displays  the 
wideband  spectrogram  of  the  utterance,  the  sero  crossing  rate, 
the  original  waveform,  the  orthographic  as  well  as  phonetic  tran¬ 
scriptions,  and  the  short-time,  narrow  band  and  LPC  spectra. 
This  layout  illustrates  some  of  the  interactive  features  of  Spire. 
First,  display  parameters  can  be  changed  by  the  user  at  will  Thus 
for  example,  the  sero-crossing  rate  is  displayed  on  the  same  tkae 
scale  as  the  wideband  spectrogram.  Second,  all  displays  are  time- 
synchronised.  Moving  the  cursor  in  one  display  will  cause  the 
other  displays  to  change  accordingly.  Third,  displays  can  be  over¬ 
laid,  and  the  overlay  parameters  can  be  changed  as  well.  For 
example,  the  narrowband  spectra  are  overlaid  with  the  LPC  spec¬ 
tra,  and  the  LPC  spectra  are  displayed  with  a  different  thickness 
for  distinction. 

Spirt  also  has  the  capability  of  generating  high-quality  digital 
spectrograms.  The  output  device  has  a  resolution  of  200  points 
per  inch.  Figure  4  gives  an  example  of  a  digital  spectrogram. 

Spirt  was  designed  with  two  general  goals  in  mind.  First,  it 
provides  an  extremely  interactive  environment  and  a  basic  set  of 
capabilities  such  that  speech  scientists,  with  little  or  no  program¬ 
ming  experience,  are  able  to  collect  and  analyse  speech  data.  Sec¬ 
ond,  it  provides  a  framework  for  users  to  conveniently  develop 


ind  by  the  System  Development  Foundation.  The  Spin  system 
was  originally  designed  and  implemented  by  David  W.  Shipman. 
David  Kmfman  has  also  contributed  to  various  phases  o  1  its  de¬ 
velopment. 


e 

Figure  4:  A  digital  spectrogram  produced  with  Spin 
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turns  v/ithin  Spirt,  i.e.  to  add  new  attributes,  to  suit  their  own 
research  nee<ls.  Currently  the  default  version  of  Spire  contains 
approximately  40  attributes  of  the  speech  signal,  whereas  a  cus¬ 
tomized  version  can  compute  any  number  of  attributes.  Some  of 
the  customized  Spin  versions  in  our  research  group  have  as  many 
i  as  three  hundred  attributes. 

!  ^  Summary 

The  development  of  the  Spin  system  was  a  5  man-year  effort, 

;  spread  over  a  period  of  three  years.  Our  goal  is  to  create  a  research 
■  environment  that  is  easy  to  use,  and  thus  increase  the  amount  of 
d->ta  that  a  speech  scientist  can  examine,  leading  to  an  increase  in 
ohr  speech  knowledge.  It  has  played  an  important  role  in  advanc¬ 
ing  our  understanding  of  the  acoustic  properties  of  speech  sounds. 

While  Spire  is  still  being  actively  improved,  we  are  eager  to 
f  share  our  development  results  with  other  researchers  who  may  find. 
1  uch  a  system  usefuL  In  fact,  the  system  configuration  has  been 
f  duplicated,  and  the  software  acquired,  by  many  research  labora¬ 
tories  and  universities  outside  of  MIT.  Those  who  are  interested 
should  contact  the  hfIT  Patent  Office  directly  for  licensing  proce¬ 
dures. 
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ABSTRACT 

This  paper  deals  with  the  analysis  and  recognition  of  nasal  conso¬ 
nants,  /m,  n,  l)/,  in  American  English.  The  acoustic  analysis  was  per¬ 
formed  on  a  database  of  over  1,200  words  excised  from  a  carrier  phrase 
and  spoken  by  six  speakers,  three  male  and  three  female.  Across  all 
speakers,  we  found  that  nasal  murmurs  are  characterised  primarily  by  a 
lower  spect.al  amplitude  than  the  adjacent  rowels,  and  by  the  presence 
of  a  dominant,  low-frequency  peak  in  the  short-time  spectra.  Automatic 
procedures  were  then  devised  to  reliably  extract  acoustic  attributes  that 
reflect  these  characteristics.  Finally,  recognition  experiments  were  per¬ 
formed  to  test  the  validity  of  these  attributes.  Our  results,  based  on 
600  sentences  spoken  by  60  speakers,  shows  that  nasal  consonants  can 
be  distinguished  from  the  imposters  with  an  accuracy  of  83.6%. 

INTRODUCTION 

This  paper  deals  with  the  analysis  and  recognition  of  nasal  conso¬ 
nants,  /m,  n,  l]/,  in  American  English.  Nasal  consonants  are  produced 
with  a  closure  in  the  oral  cavity.  By  lowering  the  velum,  airflow  is  di¬ 
rected  through  the  nasal  cavity  and  eventually  radiated  from  the  nostrils. 
The  transfer  function  for  nasal  production  contains  poles  as  well  as  seros, 
the  latter  a  consequence  of  the  fact  that  the  vocal  cavity  serves  as  a  side 
branch.  Thus  the  power  spectrum  of  nasal  consonants  shows  spectral 
prominences  as  well  as  spectral  notches  |3|. 

In  American  English,  nasal  consonants  appear  quite  frequently.  (They 
have  a  combined  frequency  of  occurrence  of  about  11%  }9|.)  Nasals  can 
be  singly  attached  to  a  vowel,  or  they  can  form  a  cluster  with  other 
consonants.  Nasal  consonants  are  difficult  to  recognise  lor  several  rea¬ 
sons.  First,  the  characteristics  during  oral  closure,  often  referred  to  as 
the  nasal  murmur,  difler  significantly  from  speaker  to  speaker  because  of 
differences  in  the  site  and  shape  of  the  nasal  and  sinus  cavities.  Second, 
a  nasal  murmur  can  also  be  affected  drastically  by  the  phonetic  envi¬ 
ronment.  In  some  cases,  the  nasal  murmur  is  almost  entirely  absent  (as 
in  "camp*)  (5|.  In  this  case  the  presence  of  the  nasal  consonant  lies  al¬ 
most  entirely  in  the  degree  of  nasalisation  in  the  adjacent  vowel.  Finally, 
the  complex  production  mechanism  makes  them  difficult  to  characterise 
acoustically. 

The  goal  of  our  research  is  twofold:  (1)  We  arc  interested  in  dis¬ 
covering  speaker-independent  acoustic  cues  for  nasal  murmurs,  and  (2) 
We  would  like  to  test  the  effectiveness  of  these  cues  in  actual  recognition 
experiments  where  the  nasal  consonants  are  to  be  distinguished  from  im¬ 
posters.  While  the  acoustic  characteristics  of  nasal  murmurs  have  been 
studied  extensively  in  the  past  (2|,  (3],  these  results  are  not  directly  ap¬ 
plicable  to  speech  recognition.  In  some  cases,  the  acoustic  analyses  were 
not  based  on  a  sufficient  amount  of  data.  Most  of  these  studies  were  pri¬ 
marily  concerned  with  isolated  words.  In  addition,  the  acoustic  features 
do  not  always  lend  themselves  to  automatic  extraction  and  subsequent 
computer  recognition. 

We  must  emphasite  that  analysis  of  the  nasal  murmur  will  only  pro¬ 


vide  partial  information  regarding  the  presence  of  nasal  consonants.  An 
integral  part  of  our  study  deals  with  the  analysis  of  vowel  nasalizatio> 
(4|,  which  will  not  be  decit  with  in  this  paper. 

DATABASE  DESCRIPTION 

The  acoustic  analysis  of  nasal  consonants  was  made  from  a  database 
specifically  created  for  this  study.  The  corpus  was  based  on  a  set  of 
some  two  hundred  carefully  chosen  words  that  contain  nasal  consonants 
in  many  different  phonetic  contexts.  For  example,  the  nasal  conso¬ 
nant  may  appear  prevocalically  (as  in  “mitt"),  medially  (as  in  'sim¬ 
mer”),  postvocalicaiiy  (as  in  'dim”),  and  in  dusters  (as  in  'snow*  and 
’think”).  In  addition,  some  words  containing  imposters,  i.e.  other 
consonants  that  are  acoustically  similar  to  the  nasal  consonants  (as  iu 
'denme/devise"),  are  also  included.  Many  of  the  words  form  minimal 
pairs  (as  in  ‘sin/sing/sick/sink/sinking* )  so  that  the  effect  of  the  pho¬ 
netic  context  can  be  isolated  and  subtle  acoustic  differences  identified. 

Once  the  corpus  had  been  designed  the  database  was  created.  Three 
male  and  three  female  speakers  each  read  the  words  of  the  corpus  which 
were  embedded  in  a  carrier  phrase.  This  resulted  in  a  database  of  slightly 
over  1,200  words.  All  utterances  were  digitized  at  16  kHz  and  the  pho- 
netie  transcriptions  were  manually  time  aligned  with  the  waveforms. 
Temporal  measurements  were  made  from  the  time-alignrd  transcrip¬ 
tion.  while  spectral  measurements  were  made  from  a  cepstrally  smoothed 
short-time  spectrum. 

ACOUSTIC  ANALYSIS 

Apart  from  the  manual  alignment  of  the  phonetic  transcription  with  the 
speech  waveform  during  data  preparation,  all  other  measurements  and 
analyses  were  performed  without  human  intervention.  This  automatic 
procedure  permits  immediate  implementation  for  nasal  recognition  once 
an  acoustic  attribute  has  been  shown  to  be  robust.  Another  advantage 
of  automatic  analysis  is  that  a  large  amount  of  data  can  be  analyzed  in 
a  reasonable  amount  of  time.  Thus  the  size  of  the  database  was  limited 
primarily  by  the  time  it  took  to  complete  the  time  alignment.  Automatic 
analysis  must  be  done  carefully  however,  or  one  runs  the  risk  of  adding 
measurement  noise  into  the  distributions  of  the  data. 

Figure  1  shows  the  spectrograms  of  the  words  ’simmer",  'sinner*, 
and  “singer”,  spoken  by  a  male  speaker.  These  spectrograms  serve  to 
illustrate  some  of  the  qualitative  features  common  to  all  nasal  conso¬ 
nants.  We  see  that  the  nasal  murmur  typically  has  lower  energy  than 
the  adjacent  vowels.  It  is  also  delineated  from  the  vowels  by  a  sharp 
spectral  discontinuity.  In  addition,  it  is  characterized  by  the  presence  of 
a  low  frequency  spectral  peak.  There  are,  however,  other  som  with 
acoustic  characteristics  similar  to  those  of  nasal  murmurs  Son-  f  these 
sounds,  a  voice-bar  (referring  to  the  closure  portion  of  voiced  stops)  and 
a  prevocalic  /I/,  are  shown  in  Figure  2. 

The  acoustic  analysis  is  focused  on  quantifying  the  features  shown 
in  the  spectrograms.  We  started  by  measuring  the  duration  of  nasal 


overlap  in  the  distributions,  it  is  dear  that,  on  the  average,  nasal  con¬ 
sonants  hare  a  greater  decrease  in  energy  than  liquids  and  glides,  and 
hare  a  smaller  difference  than  roice  bars.  Thus,  from  a  speech  recog¬ 
nition  point  of  riew,  this  measurement  may  help  to  distinguish  nasals 
from  liquids  and  glides. 


n  nm  m  a  va 

|S?0)  (440)  (SO)  (S37|  |)4S) 

Figure  3:  Statistical  Summary  of  the  Arerage  Energy  for  All  Nasals  (N), 
Non-Medial  Nasals  (NM),  Media]  Nasals  (M),  Liquids  and  Glides  (G), 
and  Voice  Bars  ( VD). 


murmurs.  In  agreement  with  prerious  studies  (8)  we  found  that  the 
duration  of  the  nasal  murmur  is  strongly  influenced  by  phonetic  context. 
For  example,  the  nasal  murmur  is  shortened  when  it  is  in  a  eluster  with 
a  roiceless  consonant,  and  is  lengthened  when  it  is  in  a  cluster  with  a 
roiced  consonant.  This  result  was  found  to  be  true  for  both  word-initial 
clusters,  as  in  ‘smack’,  and  word-final  clusters,  as  in  ‘can't*.  Although 
these  differences  are  fairly  rohust,  our  ability  to  utilise  such  information 
in  speech  recognition  must  depend  on  knowledge  of  their  exact  eontext 
and  the  speaking  rate. 


We  next  inrestigated  the  energy  in  the  nasal  murmur  reiatire  to  the 
adjacent  rowels.  This  energy  difference  is  determined  by  subtracting  the 
arerage  total  energy  in  the  nasal  murmur  from  the  arerage  total  energy 
in  the  adjacent  rowels.  For  the  tokens  in  the  database,  this  energy 
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Figure  2:  Spectrogram  of  the  words  ’ha.amock",  *cab"  and  ‘lip’,  spoken 
by  a  male  speaker. 


difference  was  almost  always  positire,  implying  that  the  nasal  murmur 
A  is  consistently  weaker  than  an  adjacent  sonorant. 


As  illustrated  in  figure  3,  nasal  consonants  in  a  medial  position  hare  a 
slightly  smaller  energy  difference  than  nasal  consonants  in  other  contexts. 
This  is  probahly  due  to  the  fact  that  medial  nasals  hare  a  sustained  lerel 
of  energy  when  surrounded  by  rowels.  In  contrast,  energy  during  nasal 
murmurs  tends  to  rise  gradually  in  prerocalic  positions,  and  taper  off  in 
postvocalic  positions,  resulting  in  a  lower  arerage  ralue.  This  ohserration 
is  reinforced  by  studies  of  the  nasal  murmur  stability,  which  indicate  that 
the  energy  of  medial  nasals  is  quite  steady 

Figure  3  also  compares  the  ralue  of  the  energy  difference  of  the  nasal 
consonants  to  similar  sounds  such  as  liquids  and  glides,  and  roice  bars, 


(The  arerage  ralue  is  indicated  by  a  filled  circle  The  rertical  lines 
indicate  one  standard  deriation  and  the  open  circles  display  maximum 
and  minimum  ralues.  The  numher  of  samples  in  each  context  is  indicated 
helow  the  display.) 


Spectral  analysis  of  the  nasal  consonants  was  hased  on  a  cepstrally 
smoothed  spectrum  created  from  the  pre-emphasized  wareform.  The 
spectra  were  all  normalized  with  respect  to  total  energy  to  eliminate  off¬ 
sets.  During  analysis  we  also  restricted  ourselres  to  analyzing  the  spectra 
of  one  rpeaker  at  a  time  since  we  found  that  the  spectra,  primarily  at 
frequencies  abore  1000  Hz,  vere  highly  speaker  dependent.  This  depen¬ 
dency  is  probably  due  to  the  fact  that  the  size  of  the  nasal  and  sinus 
cardies  eau  rary  greatly  from  speaker  to  speaker. 

Statistics  were  gathered  by  collecting  multiple  spectra,  computed  er- 
ery  10  ms,  from  all  of  the  nasal  murmurs.  Although  the  averaging  pro¬ 
cedure  can  potentially  smear  useful  information,  it  'serve  the  purpose  of 
revealing  general  trends  across  "all  phonetic  contexts.  Figure  4  shows  ex¬ 
amples  of  the  average  spectra  for  the  three  nasal  consonants  for  a  male 
speaker.  Although  subtle  differences  could  be  detected  among  the  three 
nasal  consonants  for  any  given  speaker,  such  difference  are  overshadowed 
hy  their  similarities.  This  observation  has  been  made  previously  by  Fu- 
jimura,  who  also  found  little  differences  among  the  spectra  of  the  three 
nasal  murmurs  |3|.  Furthermore,  we  found  that  the  spectral  shape  of  the 
nasal  murmur  was  relatively  unaffected  by  phonetic  context. 

A  more  quantitative  analysis  verified  these  general  observations.  Specif¬ 
ically,  we  found  that  the  nasal  murmur  spectra  were  characterized  by  a 
low  frequency  resonance  which  dominated  the  spectrum.  This  low  fre¬ 
quency  spectral  peak  was  nearly  always  centered  between  200  and  350 
Hz.  The  amplitude  of  this  low  frequency  resonance  is  quite  stable,  and 
is  higher  for  nasals  than  for  semivowels,  as  shown  in  Figure  5.  Another 
characteristic  of  the  nasal  murmur  was  an  abrupt  decrease  of  energy 
in  the  frequencies  immediately  following  the  low  frequency  resonance. 
Again,  this  attribute  can  be  effective  in  distinguishing  nasal  murmurs 
from  semivowels. 

RECOGNITION  EXPERIMENTS 

After  establishing  some  of  the  acoustic  properties  of  nasal  consonants  and 
developing  attributes  that  capture  these  properties,  preliminary  investi¬ 
gations  were  conducted  to  evaluate  the  usefulness  of  these  attributes  in 
speech  recognition.  In  order  to  be  consistent  with  the  ways  the  acoustic 
analyses  is  carried  out,  we  have  structured  the  recognition  experiments 
as  a  set  of  discrimination  tests.  Specifically,  the  nasal  detection  system 
is  given  a  test  token  and  a  set  of  training  data.  The  system  must  then 
classify  the  token  as  either  a  nasal  murmur,  or  an  imposter  sound,  such 


Figure  4:  Average  Spectrum  of  the  Nasal  Consonants  /n/  (Top),  /m/ 
(Dottom  Left),  and  /t]/  (Bottom  Right),  for  a  male  speaker.  (The  av- 
9  erage  spectrum,  shown  by  the  dark  line,  is  surrounded  by  lines  which 
represent  one  standard  deviation  from  the  mean.) 


as  a  liquid,  glide,  voice  har  or  voiced  fricative.  Throughout  these  ex¬ 
periments,  it  is  assumed  that  the  boundaries  of  the  murmur  are  known, 
and  that  there  is  some  knowledge  of  the  broad  phonetic  context.  For  the 
'  experiments  described  in  this  paper,  no  information  in  adjacent  sounds 
is  utilised. 

The  Strategy 

Our  acoustic  study  produced  several  attributes,  each  potentially  use¬ 
ful  in  characterising  a  certain  aspect  the  the  nasal  murmur.  We  have 
O  chosen  to  incorporate  the  five  most  robust  measurements  into  detection 


£  Figure  5:  Histograms  of  the  Amplitude  of  the  Low  Frequency  Resonance 
Obtained  from  the  Average  Spectra  for  Nasals  (Solid  Line,  520  Tokens) 
and  Semivowels  (Dashed  Line,  357  Tokens). 


systems  for  the  task  in  hand.  The  five  measures  are: 

•  Energy:  The  difference  in  average  energy  between  the  consonant 
f  and  the  adjacent  vowel, 


•  Percentage:  The  percentage  of  the  time  that  there  was  a  low  fre¬ 
quency  resonance  centered  between  200  and  350  Hi  in  the  conso¬ 
nant, 


•  Strength:  The  average  amplitude  of  this  resonance  in  the  conso¬ 
nant, 

•  Drop  The  average  energy  drop  from  the  low  frequency  resonance 
to  the  frequency  region  immediately  above,  and 


•  Stability  The  change  in  low  frequency  energy  throughout  the  con¬ 
sonant. 

Thus,  a  given  token  is  associated  with  a  set  of  five  values  that  cor¬ 
respond  to  the  acoustic  measurements  made  on  the  test  token.  If  we 


consider  the  set  of  values  as  a  vector  in  a  five  dimensional  space,  we  are 
faced  with  a  multi-dimensional  decision  making  problem 

After  examining  several  possible  classification  procedures  |4|,  we  set¬ 
tled  on  the  use  of  a  binary  tree  classification  technique.  Decision  tree 
classifiers  have  been  used  in  a  variety  of  pattern  recognition  problems 
and  have  a  number  of  important  advantages  over  single  stage  classifi¬ 
cation  techniques,  particularly  in  problems  of  high  dimensionality  |1|. 
Through  the  use  of  the  decision  tree,  a  complex  glohal  decision  is  made 
through  a  series  of  simpler,  local  decisions  at  each  level  of  the  tree.  This 
approach  is  amenable  to  situations  when  the  decision  surface  is  complex, 
or  when  the  number  of  classes  to  be  identified  is  large  More  importantly, 
the  decision  tree  is  a  structure  which  is  easy  to  inte.p-ct  and  can  often 
provide  insight  into  a  particular  problem  (7|.  In  addition,  knowledge 
obtained  through  constructing  the  tree  using  the  learning  sample  can  be 
augmented  by  human  supervision  and  guidance  during  tree  growth. 

Experiment  One 

As  a  first  step  in  validating  the  usefulness  of  the  acoustic  attributes, 
the  recognition  task  was  performed  using  the  utterances  of  the  original 
database.  520  nasal  murmurs  and  695  impostor  sounds  were  used  in  this 
experiment.  In  order  to  approximate  a  speaker-independent  task  given 
the  limited  amount  of  available  data,  the  system  was  evaluated  using  a 
rotational  procedure.  In  each  step,  the  system  was  trained  on  data  from 
five  of  the  six  speakers  in  the  database,  and  was  tested  on  the  data  from 
the  sixth  speaker. 

Earlier  investigation  reveals  that  the  usefulness  of  the  attributes  de¬ 
pends  on  knowledge  of  the  broad  phonetic  context.  As  a  result,  the 
data  were  first  divided  into  three  categories  based  on  the  broad  phonetic 
context;  prevocalic,  post-vocalic,  and  intervocalic.  Apart  from  this  pre¬ 
liminary  structure,  decision  trees  were  grown  automatically  using  the 
training  data.  The  results  of  this  experiment  are  shown  in  Table  1.  We 
see  that,  for  this  database,  the  system  produced  an  average  identification 
rate  of  83.6%. 

Table  1:  Detection  Confusions  on  Small  Database 
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Experiment  Two 

Although  the  results  of  experiment  one  were  encouraging,  there  is  a 
possibility  that  these  results  were  a  reflection  of  the  fact  that  the  same 
database  was  used  for  system  development  and  system  evaluation.  As 
a  result,  a  second  database  was  collected  in  order  to  provide  a  more 
realistic  evaluation  of  the  recognition  systems. 

The  new  database  contained  ten  sentences  recorded  from  thirty  male 
and  thirty  female  speakers.  In  addition  to  the  phonetic  environments 
investigated  previously,  this  second  database  also  contains  nasal  conso¬ 
nants  interacting  with  vowels  and  consonants  across  word  boundaries. 
All  in  all,  this  second  database  provided  over  1100  nasals  and  2000  im¬ 
poster  tokens  from  60  speakers,  in  a  continuous  speech  environment. 
Once  again,  the  system  was  evaluated  in  a  rotational  procedure,  this 
time  training  on  data  from  fifty  of  the  speakers,  and  tested  on  the  data 
of  the  remaining  ten  speakers. 

Using  the  broader  context  produced  an  average  identification  rate  of 
83.5%,  which  is  essentially  identical  to  the  results  of  Ex,  eriment  1,  A 
breakdown  of  the  result  may  tie  found  in  table  2. 


DISCUSSION 


The  results  of  the  two  recognition  experiments  demonstrate  that  the 
automatically  extracted  acoustic  attributes  nre  useful  in  distinguishing 


Table  2:  Detection  Confusions  on  Large  Database 


Output  (%) 

Input 

Nasal 

Impostor 

Nasal 

79 

21 

Impostor 

15 

85 

nasal  murmurs  from  imposters  across  a  large  number  of  male  and  female 
speakers  We  are  encouraged  by  these  results  for  several  reasons.  First, 
closer  examination  of  the  experimental  results  reveals  that  the  recogni¬ 
tion  performance  varies  little  from  speaker  to  speaker,  suggesting  that 
the  acoustic  attributes  are  capturing  speaker-independent  cues.  Second, 
the  binary  decision  trees  for  different  training  samples  appear  to  be  very 
similar,  i.e.,  the  same  attributes  are  often  used  at  the  same  node  in  dif¬ 
ferent  trees.  The  stability  of  the  decision  trees  is  indicative  of  the  fact 
that  the  acoustic  attributes,  as  well  as  the  way  they  are  being  utilised 
are  quite  robust.  Finally,  on.  must  keep  in  mind  that  we  have  based 
the  recognition  of  nasal  consonants  solely  on  information  contained  in 
the  nasal  murmurs.  In  many  phonetic  environments,  the  nasal  murmur 
is  both  weak  in  amplitude  and  short  in  duration.  As  a  result,  the  nasal 
murmur  may  not  always  provide  the  clearest  information  regarding  the 
presence  of  a  nasal  consonant.  By  incorporating  knowledge  of  the  degree 
of  nasalisation  in  adjacent  vowels,  better  recognition  performance  can  be 
expected. 

By  comparing  Tables  1  and  1  wi  see  that  imposters  are  identified 
with  similar  accuracy,  whereas  nasals  are  identified  less  accurately  in  the 
second,  larger  database.  We  believe  this  difference  may  be  due  to  the  fact 
that  the  speech  data  is  acoustically  more  variable  in  the  second  database. 
(Recall  that  the  first  datahase  consisted  of  words  excised  from  a  carrier 
phrase,  whereas  the  second  datT.hase  contains  continuous  sentences.)  In 
addition,  the  second  experiment  utilised  proportionally  more  imposter 
than  nasals.  Since  the  hins  ;  tree  classifier  inherently  incorporate  a 
priori  frequency  of  occurrence  information  into  the  tree  structure,  it  is 
expected  to  perform  better  for  the  more  likely  candidates. 

Our  preliminary  examination  of  the  decision  tree  suggested  that  dif¬ 
ferent  attributes  may  he  effective  in  different  phonetic  contexts.  The 
results  of  the  recognition  experiments  indicate  that  this  is  indeed  the 
case.  The  average  recognition  scores  85.8%,  80.3%,  and  83.5%  for  the 
prevocalic,  medial,  and  postvocalic  context,  respectively.  In  fact,  the 
decision  tree  looks  quite  different  for  the  three  contexts.  For  example, 
liability  was  found  to  he  the  most  useful  attribute  in  the  medial  context, 
but  not  very  reliahle  in  the  postvocalic  context.  This  is  presumably  due 
to  the  fact  that  the  low  frequency  energy  is  higher  and  more  steady  in 
the  medial  context,  as  discussed  earlier.  Note  that  we  have  chosen  to 
partition  the  data  in  terms  of  broad  phonetic  context.  This  approach 
stems  from  our  belief  that  such  contexts  can  be  established  effectively  in 
practical  recognition  tasks. 

Examination  of  the  recognition  results  indicates  that  there  is  u  slight 
difference  in  performance  between  male  and  female  speakers.  This  could 
be  due  to  differences  in  vocal  tract  sires.  Our  evaluation  procedure, 
which  always  test  a  group  of  ten  all-male  or  all-female  speakers,  may  have 
further  enhanced  this  contrast.  By  incorporating  an  equal  proportion  of 
male  and  female  speakers  into  the  training  and  test  data,  slightly  hetter 
recognition  results  may  be  observed. 

SUMMARY 

In  summary,  our  acoustic  analyses  resulted  in  the  discovery  of  some  dis¬ 
tinct  characteristics  of  nasal  murmurs,  and  we  are  encouraged  hy  the 
preliminary  results  of  using  these  acoustic  attributes  for  nasal  recogni¬ 
tion.  We  feel  that  our  study  supports  the  notion  that  a  better  under¬ 
standing  of  the  acoustic  properties  of  speech  sounds  will  lead  to  improved 
performance  in  phonetic  recognition. 

Future  work  in  this  direction  includes  the  characterisation  and  recog¬ 
nition  of  nasalised  vowels,  and  the  utilisation  of  acoustic  information  in 


hoth  the  vowel-  and  murmur  portions  to  identify  tke  cr-al  consonants 
We  also  plan  to  investigate  procedures  that  automatically  identify  the 
boundaries  hetween  vowels  and  oral  closure. 

(This  research  was  supported  by  the  Natural  Sciences  and  Engineer¬ 
ing  Council  of  Canada  and  hy  the  Office  of  Naval  Research  under  contract 
N00014-83-K-0737 .  j 
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ABSTRACT 


Motivation 


This  paper  is  concerned  with  the  design  and  implementation  of  a  sys¬ 
tem  for  automatic  alignment  of  phonetic  transcriptions  with  continuous 
speech.  The  implemented  system  consists  of  three  modules.  The  speech 
signal  is  first  segmented  into  hroad  classes  using  a  non-parametric  pat¬ 
tern  classifier.  Path  finding  techniques  are  then  used  to  align  the  hroad 
classes  with  the  phonetic  transcriptions.  These  aligned  hroad  classes 
provide  “islands  of  reliability"  for  more  detailed  segmentation  and  refine¬ 
ment  of  boundaries.  Specific  speech  knowledge  is  utilized  throughout  the 
system.  By  doing  alignment  at  the  phonetic  level,  the  system  can  often 
tolei  .ie  inter-  and  intra-speaker  variability.  The  system  was  evaluated 
on  seven  hundred  sentences,  spoken  hy  male  and  female  speakers.  97% 
of  the  segments  are  mapped  into  only  one  phonetic  event,  approximately 
80%  of  the  time  the  offset  hetween  the  boundary  found  by  the  automatic 
alignment  system  and  a  trained  transcriber  is  less  than  10  ms.  Support¬ 
ing  software  has  also  been  developed  so  that  final  manual  adjustments, 
if  needed,  can  be  made. 

INTRODUCTION 

Problem  Statement 

This  paper  describes  a  system  that  automatically  aligns  a  phonetic 
transcription  with  the  associated  speech  waveform.  In  developing  such  a 
system,  we  assumed  that  the  speech  signal  has  an  underlying  represen¬ 
tation  that  consists  of  a  sequence  of  phonetic  segments.  We  recognize 
that  during  speech  production,  the  acoustic  realization  of  these  phonetic 
segments  may  blend  from  one  segment  to  another,  due  to  the  interaction 
among  the  various  articulatory  structures  and  their  different  degrees  of 
sluggishness.  The  task  of  the  system  is  then  to  associate  a  phonetic  lahel 
with  regions  delineated  hy  significant  acoustic  landmarki.  We  do  not  in 
any  way  imply  that  the  alignment  system  finds  the  houndaries  between 
phonetic  segments. 

The  reliahility  of  the  acoustic  landmarks  in  continuous  speech  is  not 
at  all  uniform.  Some  landmarks  are  )bvious  and  clear  while  others  are 
more  suhtle.  Figure  1  illustrates  the  spectrogram  and  various  displays 
for  the  phrase,  ‘Glue  the  sheet  to  the  dark..",  spoken  hy  a  male  speaker. 
Row  (a)  of  the  Figure  shows  the  phonetic  transcription  which  is  manu¬ 
ally  aligned  hy  an  experienced  acoustic  phonetician.  As  can  he  seen  from 
the  spectrogram,  the  transition  from  a  strong  fricative  to  a  vowel,  as  in 
the  word  “sheet",  is  strongly  evidenced  hy  the  abrupt  decrease  of  high 
frequency  energy  and  a  sharp  onset  of  low  frequency  energy.  This  kind 
of  acoustic  landmark  is  relatively  easy  to  detect.  On  the  other  hand, 
the  transition  between  a  vowel  and  a  sonorant  as  in  the  word  “dark",  is 
marked  by  more  gradual  acoustic  changes.  This  second  acoustic  land¬ 
mark  is  often  quite  subtle  and  is,  in  general,  difficult  to  locate  without 
first  estahlishing  the  phonetic  context.  Therefore,  the  difficulty  of  the 
time  alignment  task  will  vary  from  one  type  of  transition  to  another. 


Phonetic  alignment  is  essential  to  many  areas  of  speech  research, 
since  the  time-aligned  transcription  can  serve  as  pointers  to  specific  pho¬ 
netic  events  in  the  waveform.  If  a  sufficient  amount  of  time-aligned  acous¬ 
tic  data  is  availahle,  speech  researchers  will  then  he  able  to  quantify  the 
properties  of  phonetic  segments  and  descrihe  how  their  characteristics 
are  modified  hy  contexts.  These  results  in  turn  will  lead  to  a  hetter 
model  for  speech  production,  as  well  as  hetter  rules  for  speech  synthesis 
and  recognition.  A  large  datahase  of  aligned  speech  material  is  particu¬ 
larly  important  for  phonetic  recognition,  since  it  can  be  used  for  phonetic 
knowledge  acquisition,  ruie  development,  as  well  as  system  training  and 
evaluation. 

The  automatic  phonetic  alignment  system  can  also  serve  as  a  testbed 
for  the  development  of  specific  phonetic  recognition  algorithms.  It  is  well 
known  that  detailed  phonetic  recognition  is  extremely  difficult,  due  to 
the  context  dependency  of  the  acoustic  realizations.  In  the  automatic 
alignment  task,  we  attempt  to  locate  specific  phonetic  events  when  the 
identity  and  the  contexts  are  known.  Thus  it  can  be  viewed  as  a  learning 
step  towards  phonetic  recognition. 

Traditionally,  the  alignment  is  done  manually  hy  a  trained  acous¬ 
tic  phonetician,  who  listens  to  the  speech  signal  and  visually  examines 
various  displays  of  the  signal.  There  are  several  disadvantages  to  this 
approach.  First,  the  task  is  extremely  time  consuming;  even  under  the 
best  of  circumstances,  the  process  of  time  alignment  can  take  several 
minutes  for  one  second  of  speech  material.  Second,  the  task  requires 
the  skill  and  knowledge  possessed  by  a  small  numher  of  experts.  These 
two  reasons  combine  to  severely  limit  the  amount  of  data  that  can  be 
collected  in  this  manner.  Third,  manual  laheling  often  involves  decisions 
that  are  highly  suhjective.  Therefore,  there  is  the  lack  of  consistency  and 
reproducihility  of  the  results.  Even  if  the  sentence  and  the  transcription 
were  the  same,  the  inter-  and  intra-transcriber  variability  can  still  be 
quite  high.  Finally,  there  is  the  prohlem  of  human  error  associated  with 
tedious  tasks. 

Review  of  Literature 

Over  the  past  few  years,  several  automatic  time  alignment  procedures 
have  heen  suggested  in  the  literature.  Most  of  these  approaches  attempt 
to  align  the  speech  waveform  with  a  reference  waveform,  using  dynamic 
programming  algorithms.  The  reference  waveform  may  be  a  known  and 
previously  labeled  utterance  [l|,  [3],  a  concatenation  of  stored  templates 
[8|,  or  a  synthetically  generated  utterance  [6|.  In  order  for  these  methods 
to  be  effective,  the  two  waveforms  must  not  differ  significantly  in  detailed 
phonetic  structures,  or  the  synthesis  rules  must  be  fairly  advanced.  A 
second  approach,  which  also  uses  dynamic  programming,  is  to  segment 
and  label  the  waveform  into  broad  phonetic  classes  prior  to  time  align¬ 
ment  (ll|.  A  more  detailed  frame-hy-frame  labeling  is  then  achieved  by 
a  second  dynamic  programming  algorithm,  using  derivatives  of  energy 
and  formant  functions. 
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Figure  1:  An  example  for  the  phrase,  ‘Glue  the  sheet  to  the  dark..' 


This  paper  describes  a  new  method  of  automatic  phonetic  alignment. 
This  method  utilizes  a  standard  pattern  classification  algorithm,  a  path 
finding  algorithm,  and  the  constraints  imposed  by  our  acoustic-phonetic 
knowledge.  The  speech  signal  is  first  segmented  into  broad  phonetic 
classes  using  a  non-parametric  pattern  classifier.  The  resulting  string 
is  then  aligned  with  the  transcription  using  branch  and  bound  search. 
Acoustic-phonetic  knowledge  is  utilized  extensively  in  the  feature  extrac¬ 
tion  for  pattern  classification,  the  specification  of  constraints  for  time- 
aligned  paths,  and  the  subsequent  segmentation/labeling  and  refinement 
of  boundaries. 

SYSTEM  DESCRIPTION 


Initial  Broad  Classification 

In  our  opinion,  direct  alignment  of  the  speech  waveform  with  the 
detailed  phonetic  transcription  is  difficult,  due  to  the  high  degree  of 
acourtic  variability  in  the  speech  signal.  Our  approach  is  to  make  an  ini¬ 
tial  broad  classification,  relying  on  statistical  pattern  classification  tech¬ 
niques.  The  objective  is  to  determine  robust  acoustic-phonetic  events 
that  are  relatively  context-independent,  and  to  use  these  events  as  an 
chor  points  for  more  detailed  analysis.  We  have  chosen  to  structure  the 
classifier  as  a  sequence  of  binary  classifiers  arranged  in  a  binary  decision 
tree.  One  advantage  of  using  a  set  of  classifiers  is  that  a  different  feature 
vector  can  be  used  for  each  classifier  in  order  to  maximize  the  contrasts 
between  the  two  oossible  output  classes.  For  example,  zero-crossing  rate 
is  helpful  for  distinguishing  sonorants  and  obstruents,  but  not  for  dis¬ 
tinguishing  vowels  irom  other  voiced  consonants.  Thus  the  problem  of 
classifying  the  speech  signal  into  different  classes  can  be  reduced  to  a 
sequence  of  sub-problems,  which  are  relatively  easier  to  tackle. 

At  each  node  in  the  decision  tree,  a  binary  decision  is  made  by  a  pat¬ 
tern  classification  machine  as  shown  in  Figure  3.  The  structure  of  each 
of  the  classifiers  is  identical;  the  only  difference  is  the  feature  vectors  and 
initial  seed  points  used  in  the  clustering  algorithm.  Each  classifier  starts 
with  a  set  of  M  parameters  selected  based  on  acoustic-phonetic  knowl¬ 
edge.  (The  number  of  parameters  used,  M,  may  be  different  for  each 
of  the  binary  classifiers.)  The  parameters  are  then  processed  through 
a  seven-point  median  smoother,  clipped,  and  normalized.  Clipping  is 
intended  to  emphasize  the  portions  of  the  speech  signal  where  bound¬ 
aries  are  likely  to  occur.  Clipping  thresholds  are  determined  statistically 
from  a  set  of  training  data,  such  that  segment  boundaries  fall  within  the 
transitional  regions.  Normalization  then  transforms  each  of  the  clipped 
feature  parameters  to  the  same  scale.  Together  the  clipping  and  normal¬ 
ization  procedures  effectively  assign  different  weights  to  different  feature 
parameters  depending  on  how  much  the  feature  parameter  distributions 
of  the  two  classes  overlap. 

Every  5  ms,  an  M-dimensional  feature  vector  is  obtained.  All  the 


Figure  3;  Block  diagram  of  the  pattern  classification  machine.  Super¬ 
scripts  denote  number  of  samples. 


The  basic  structure  of  the  system  that  we  have  developed  is  shown 
in  Figure  1.  The  speech  signal  is  digitized  at  16  kHz  and  captured 
by  an  automatic  end-point  detection  algorithm  |5).  From  the  speech 
signal,  a  number  of  parameters  are  computed.  These  parameters  are 
then  used  in  conjunction  with  a  pattern  classifier  to  produce  5  broad 
phonetic  classes.  The  output  of  the  classifier  is  used  to  time-align  major 
and  robust  acoustic  events  with  the  phonetic  transcription.  This  initial 
time  alignment  serves  as  anchor  points  for  subsequent  detailed  phonetic 
alignment  utilizing  a  set  of  heuristic  rules  (7). 


Figure  2;  Basic  system  structure 
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feature  vectors  for  a  given  utterance  constitute  samples  in  the  feature 
space.  A  binary  decision  is  made  in  the  M-dimensional  feature  space 
by  using  K-Means  clustering,  with  a  Euclidean  distance  metric  (10).  It 
is  well  known  that  the  location  of  the  cluster  centroids  and  the  speed 
of  convergence  for  a  clustering  algorithm  depends  on  the  choice  of  the 
initial  seed  points,  the  number  of  dusters,  and  the  geometrical  distri¬ 
butions  of  the  data.  The  use  of  a  binary  classifier  has  the  advantage 
that  the  algorithm  always  converges.  In  addition,  the  binary  classifier 
enables  us  to  apply  our  acoustic-phonetic  knowledge  and  select  initial 
seed  points  at  the  appropriate  extrema  of  the  feature  space  to  maximize 
the  contrast.  We  found  that  the  algorithm  typically  converges  after  less 
than  10  iterations. 

At  the  top  of  the  decision  tree,  the  clustering  algorithm  assigns  one 
of  two  labels  to  every  frame  of  data.  Each  class  of  data  will  pass  through 
a  different  classifier  at  a  lower  node,  and  the  process  repeats  Our  ex¬ 
perience  has  shown  that  the  broad  phonetic  classifier  performs  vrry  well 
if  the  total  number  of  classes  is  small,  say  5  or  6.  The  performanc-  of 
the  classifier  degrades  substantially  when  one  attempts  to  use  it  to  make 
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Vac  puonetic  distinction*  In  our  implementation,  the  classifier  assigns 
one  of  fire  labels  to  every  frame  of  the  data:  S  (vowel-like  sonorant),  0 
(obstruent),  -  (silence),  B  (nasals  and  voice-hars),  and  D  (voiced  conso¬ 
nants).  Although  rarely  needed,  a  context-dependent  median  smoother 
is  provided  to  remove  spurious  segments. 

For  the  example  shown  in  Figure  1,  the  output  of  the  hroad  phonetic 
classifier  is  shown  in  row  (h).  Compating  the  results  with  the  spectro¬ 
gram,  we  see  that  important  and  rohust  acoustic  regions  in  this  example 
have  heen  found  hy  the  classifier. 

Alignment 

The  output  of  the  initial  classifier  is  a  broad,  but  presumably  robust, 
description  of  the  significant  acoustic-phonetic  events  in  the  speech  sig¬ 
nal.  In  order  to  use  this  hroad  phonetic  description  a*  anchor  points  for 
more  detailed  analyses,  the  broad  representation  must  now  he  aligned 
with  the  phonetic  transcription.  This  is  essentially  a  path  finding  prob¬ 
lem.  and  we  have  chosen  to  use  branch  and  bound  search,  where  the 
path  is  heavily  constrained  hy  acoustic-phonetic  rules  (12).  Figure  4  il¬ 
lustrates  how  this  is  done  for  the  same  utterance  as  shown  previously. 
The  horizontal  dimension  represents  the  output  of  the  classifier,  while 
the  vertical  dimension  represents  the  actual  transcription.  Durational 
information  is  used  hy  the  algorithm,  hut  is  not  explicitly  represented  in 
this  figure.  Two  kinds  of  constraints  direct  the  algorithm  to  search  for 
the  correct  path.  First,  the  path  is  not  allowed  to  traverse  through  cer¬ 
tain  cells,  since  this  will  produce  implausible  phonetic  alignments.  The: e 
mismatches  are  stored  as  a  set  of  con  :ext-independent  rules,  and  the  re¬ 
sulting  cells  are  marked  in  the  figure  hy  an  x.  For  example,  the  first 
phoneme  /t/  is  not  allowed  to  match  a  silence  or  a  sonorant  segment. 
Second,  there  is  a  set  of  rules  that  eliminates  certain  matches  hased  on 
contextual  information.  The  cells  eliminated  by  the  context-dependent 
rules  are  represented  in  the  figure  hy  the  unfilled  rectangles.  For  exam¬ 
ple,  the  first  / 1/  is  not  allowed  to  match  the  second  ohstruent  due  to  a 
durationel  constraint.  The  remaining  permissible  cells  are  marked  in  the 
figure  with  filled  or  unfilled  circles.  The  filled  circles  denote  the  npristirt 
path,  suhject  to  a  set  of  cost  functions  ohtained  through  training.  A* 
can  heen  seen  from  this  example,  the  acoustic-phonetic  constraints  can 
often  reduce  the  numher  of  permissible  paths  dramatically. 


cases,  there  is  only  one  phonetic  event  in  a  segment  In  other  cases,  there 
can  he  as  many  as  four  phonetic  events.  Furthermore,  a  comparison  with 
the  hand  transcription  shows  thnt  all  the  phonetic  events  for  this  example 
are  mapped  to  the  correct  hroad  class  segment. 

Knowledge-based  Segmentation 

The  knowledge-hased  path  finding  algorithm  divides  the  speech  wave¬ 
form  into  a  sequence  of  segments.  Each  segment  may  he  mapped  to  one 
or  more  phonetic  events.  No  further  processing  is  necessary  if  the  match¬ 
ing  is  one  or  more  segments  to  one  phonetic  event.  For  those  segments 
which  correspond  to  2  or  more  phonetic  events,  more  detailed  segmen¬ 
tation  is  needed.  This  is  done  in  two  separate  steps.  Certain  transitions 
hetween  phonetic  events,  such  as  the  transition  hetween  vowels  and  post¬ 
vocalic  /r/’s,  are  not  marked  hy  reliable  acoustic  cues.  In  these  cases,  we 
have  chosen  to  mark  the  boundary  hy  a  set  of  arhitrary  hut  consistent 
rules.  On  the  other  hand,  ‘he  transitions  between  some  other  phonetic 
events  are  more  distinct.  In  ‘hese  cases,  further  segmentation  is  achieved 
hy  the  proper  selection  of  feature  parameters  and  algorithms  hased  on 
contextual  information.  An  example  for  this  kind  of  phonetic  segment 
is  the  intervocalic  /8/. 

Row  (d)  of  Figure  1  shows  the  results  of  the  knowledge-based  seg¬ 
mentation.  He  see  that  all  the  phonetic  events  in  this  example  have  heen 
correctly  located. 

Application  of  Speech  Knowledge 

Throughout  the  development  of  the  system,  it  was  found  that  ex¬ 
isting  algorithms  can  he  made  more  powerful  by  judicious  application 
of  speech  knowledge.  By  structuring  the  initial  classifiers  in  a  hinary 
decision  tree,  for  example,  specific  acoustic  attributes  car  be  selected 
to  enhance  a  particular  phonetic  contrast.  Figure  5  shows  a  typical 
comparison  of  this  procedure  with  an  LPC-hased  classifier  using  the 
Itakura's  distance  metric  (4).  In  this  figure,  the  output  of  the  4-way  clas¬ 
sifier  has  heen  converted  to  a  level-coded  waveform  to  facilitate  visual 
comparison.  The  “feature-hased"  classifier  consistently  out-performs  the 
“LPC-hased"  classifier  in  that  the  ,  .-suiting  classes  are  both  stahle  and 
phonetically  meaningful.  We  conclude  that  proper  selection  f  acoustic 
measurements  for  classification  has  its  payoffs. 
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Figure  4  Alignment  mechanism  for  the  same  phrase,  “Glue  the  sheet  to 
the  dark.." 
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Figure  5:  Performance  comparison  hetween  clusterings  hased  on  LPC 
coefficients  and  pre-selected  features 


The  path  finding  algorithm  used  in  this  system  is  not  unlike  the 
popular  dynamic,  programming  algorithm  [9j.  In  this  case,  however,  the 
alignment  is  performed  at  the  segmental  level,  so  that  phonetic  rules 
can  be  used  to  severely  constrain  the  search  space.  In  the  example  of 
Figure  4,  the  path  alternatives,  shown  as  open  circles,  are  so  limited  that 
finding  the  correct  alignment  is  often  a  trivial  operation, 


Row  (c)  of  Figure  1  shows  the  results  of  the  time  alignment  obtained 
from  the  complete  path  shown  in  Figure  4.  It  can  he  seen  that  in  some 


EVALUATION 


Evaluation  1 

The  system  was  first  evaluated  using  40  distinct  sentences,  randomly 
chosen  from  the  Harvard  List  of  phonetically  balanced  sentences.  Five 
talkers,  three  male  and  two  female,  each  read  twenty  sentences,  resulting 
in  a  total  of  one  hundred  ( 100)  sentences.  The  corpus  contains  approxi¬ 
mately  4  minutes  of  speech  material  and  three  thousand  (3000)  phonetic 
events.  All  sentences  were  hand  transcribed  by  an  experienced  acoustic 
phonetician  For  comparison,  five  of  the  one  hundred  sentences,  selected 
at  random,  were  manually  labeled  by  a  second  transcriber.  The  entire 
process  of  manual  labeling  took  upwards  of  25  hours. 

Figure  6  shows  the  percentage  of  phonetic  events  located  after  two 
separate  stages  of  processing.  A  phonetic  event  is  located  when  there  is 
a  correspondence  between  it  and  one  or  more  acoustic  segments.  We  see 
that  approximately  80%  of  the  phonetic  events  have  been  located  after 
the  alignment  procedure.  This  number  increased  to  01%  after  knowledge- 
based  segmentation. 
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Figure  6:  Statistics  on  number  of  phonetie  events  in  one  segment  after 
two  different  stages  of  processing. 


We  have  also  compared  the  reliability  of  the  boundaries  found  by  the 
system  with  those  foimd  by  an  experienced  transcriber.  This  is  done  by 
computing  the  absolute  difference  between  the  two  sets  of  boundaries. 
Figure  7(a)  shows  the  cumulative  distribution  of  the  boundary  offsets 
between  the  automatic  alignment  system  and  the  acoustic  phonetician. 
We  see  that  approximately  75%  of  the  boundaries  are  within  10  msec 
of  each  other,  and  over  90%  of  the  boundaries  are  within  20  msec.  Fig¬ 
ure  7(b)  shows  the  boundary  offsets  between  the  two  transcribers  for  five 
of  the  sentences.  Since  it  is  difficult  to  assert  exactly  where  a  boundary 
should  be,  this  curve  provides  a  subjective  indication  of  the  performance 
of  the  alignment  system.  We  see  that  the  system-transcriber  differences 
are  similar  in  magnitude  to  the  inter- transcriber  differences. 
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Figure  7:  Cumulative  distributions  of  the  boundary  offset 


Evaluation  3 

The  system  was  also  evaluated  on  300  distinct  sentences,  randomly 
chosen  from  the  Harvard  List.  Sixty  speakers,  thirty  male  and  thirty  fe¬ 
male,  each  read  ten  of  the  sentences,  resulting  in  %  total  of  600  sentences, 
and  approximately  25  minutes  of  speech  material. 

Instead  of  asking  a  human  transcriber  to  label  these  sentences  for 
system  evaluation,  the  output  of  the  alignment  system  was  checked  by 
the  same  acoustic  phonetician  as  in  Evaluation  1.  Misalignments,  when 
present,  were  then  corrected  by  hand.  Figure  7  (c)  shows  the  boundary 
offsets  between  the  sets  of  boundaries  before  and  after  hand  correction 
We  can  see  that  76%  of  the  boundaries  do  not  need  to  be  corrected, 
whereas  over  80%  of  the  boundaries  are  within  10  msec  In  other  words, 
the  performance  results  for  the  two  databases  were  quite  similar 

SUMMARY 

In  this  paper  we  described  a  system  that  automatically  aligns  a  pho¬ 
netic  transcription  with  the  corresponding  speech  waveform.  The  system 
performs  initial  classification  by  a  pattern  classification  algorithm.  The 
output  of  the  classifier  is  used  to  determine  “islands  of  reliability”  for 
further  segmentation.  Dy  proper  application  of  speech  knowledge,  the 
performance  of  the  system  can  be  . improved.  The  entire  system  runs 
in  approximately  35  times  real  time  on  our  lisp  machine  workstations. 
In  addition,  supporting  software  for  entering  transcriptions  and  correct¬ 
ing  alignment  is  also  provided.  We  have  now  used  the  system  for  the 
collection  of  over  1000  sentences.  Some  of  the  output  of  the  alignment 
system  has  already  been  used  for  different  research  projects  (2).  We  are 
encouraged  by  the  results,  and  are  hopeful  that  this  system  will  play  a 
major  role  in  ertabiishing  a  large  database  for  speech  research. 

[This  research  was  supported  by  the  Office  of  Naval  Research  under 
contract  N00014-82-K-0727] 
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